Machine Learning Made Easy: Customer Journey Analysis using R(plotly)-Part 2

Customer Journey Analysis

Introduction

Customers are the single most important entity in any value chain.Everything and anything under the sun is designed to please customer so that they buy products,use medicines/drugs and what not. They are the ones who actually move the needle and hence it is important to understand what is their different touchpoints with the external stimulus.In other words, how exactly they move in their lets say shopping experience helps companies fine tune the touchpoints to better identify selling opportunities.In this blog, we will take a small dummy example to understand how different customers go through the marketing campaigns driven by a company

We will be using plotly library to create the journey plots or the Sankey Charts.The main task here is to create the dataset that will be used as an input for creating Sankey Chart

package.name<-c("dplyr","stringr","plotly")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

Mocking up Input Data

This is a sample data of 20 customers and how various marketing channels are used to target it

set.seed(42)
individual <- as.character(rep(1:20,each=5))
timeperiod <- paste0(rep(c(0, 18,36,54,72),20),"_week")
Type <- factor(sample(c("Email", "Freebees", "Coupon",  "Freebees","Vouchers"), 100, replace=T))
d <- data.frame(individual, timeperiod, Type)
head(d)

  individual timeperiod     Type
1          1     0_week    Email
2          1    18_week Vouchers
3          1    36_week    Email
4          1    54_week    Email
5          1    72_week Freebees
6          2     0_week Freebees

There are 20 customers in this data
Each customer is targeted in 0,18,36,54 and 72nd
Email,Vouchers,Freebees and Coupons are the channels used to target customers

Understanding the data steps

Below are the basic steps laid out to convert transnational data into structure suitable for creating Sankey chart.This step can also be taken as a workflow that creates source to target type of model

Brief description of the steps are :

Step 1:Creating the data at individual, time and channel level which is sorted on time
Step 2:Compressing the channel(separated by |) at individual level
Step 3:Adding source and target fields based on sequence
Step 4:Padding source and target fields based on sequence number

Step 1:Creating the data at individual, time and channel level which is sorted on time

We already have the data at the said granularity.If the data is not in the desired form, we should use aggregation

Step 2:Compressing the channel

d1<-d%>%
  select(individual,Type)%>%
  group_by(individual)%>%
  summarise(Promo_Combined = gsub(", ","|",toString(Type)))%>%
  ungroup()%>%
  group_by(Promo_Combined)%>%
  summarise(Total_Count=n())

# Renaming to give generic names

colnames(d1)<-c("Combined_Key","Total_Records")
head(d1)

# A tibble: 6 x 2
  Combined_Key                               Total_Records
  <chr>                                              <int>
1 Coupon|Email|Email|Coupon|Freebees                     1
2 Coupon|Freebees|Freebees|Freebees|Freebees             1
3 Coupon|Freebees|Vouchers|Email|Freebees                1
4 Coupon|Vouchers|Vouchers|Freebees|Vouchers             1
5 Email|Email|Vouchers|Coupon|Email                      1
6 Email|Freebees|Vouchers|Freebees|Email                 1

Step 3:Adding source and target

For every compressed channel at an individual level, create source and target fields by traversing through the length of the compressed value and assign source and target values in each iteration

l0<-list()
for(k in 1:nrow(d1)){
  
  path_descr<-str_split(d1[k,"Combined_Key"],'[|]')[[1]]
  len<-length(str_split(d1[k,"Combined_Key"],'[|]')[[1]])
  
  l1<-list()
  for(i in 1:len){
    
    if(len!=1 & i!=len){
      
      interim.df<-data.frame(Source=str_c(i,"_",path_descr[i]),
                             Target=str_c(i+1,"_",path_descr[i+1]),
                             Total_Count=d1[k,"Total_Records"])
    
      l1[[i]]<-interim.df
      
    }else{
      
      interim.df<-data.frame(Source=str_c(i,"_",path_descr[i]),
                             Target=str_c(i+1,"_","End of Activity"),
                             Total_Count=d1[k,"Total_Records"])
      
      l1[[i]]<-interim.df
      
    }
  }
  
  l0[[k]]<-do.call(rbind.data.frame,l1)
  
  
}

interim.df<-do.call(rbind.data.frame,l0)%>%
  filter(!is.na(Target))

head(interim.df)

      Source            Target Total_Records
1   1_Coupon           2_Email             1
2    2_Email           3_Email             1
3    3_Email          4_Coupon             1
4   4_Coupon        5_Freebees             1
5 5_Freebees 6_End of Activity             1
6   1_Coupon        2_Freebees             1

Step 4:Padding up source and target

The steps in the code have been labelled as #1, #2,#3 and #4.Their description is as below

1: Removing NA values from target
2: Aggregating total customer count at source and target level
3: Creating source padded variable as Actual_source
4: Creating target padded variable as Actual_target

final.df<-interim.df%>%
  filter(!is.na(Target))%>% # 1
  group_by(Source,Target)%>%
  summarise(Total_Customer=n())%>% # 2
  ungroup()%>%
  mutate(Source_Actual=sapply(Source,function(x){ # 3
    
   # x<-"2_Discount"  
   padding<- as.numeric(str_split(x,"_")[[1]][1])
   src_Val<- str_split(x,"_")[[1]][2]
   z<-str_pad(src_Val, max(nchar(src_Val))+padding, pad = " ", side = "left")
   return(z)
    
  }))%>%
  mutate(Target_Actual=sapply(Target,function(x){ # 4
    
    # x<-"2_Discount"  
    padding<- as.numeric(str_split(x,"_")[[1]][1])
    src_Val<- str_split(x,"_")[[1]][2]
    z<-str_pad(src_Val, max(nchar(src_Val))+padding, pad = " ", side = "left")
    return(z)
    
  }))

head(final.df)

# A tibble: 6 x 5
  Source   Target     Total_Customer Source_Actual Target_Actual
  <chr>    <chr>               <int> <chr>         <chr>        
1 1_Coupon 2_Email                 1 " Coupon"     "  Email"    
2 1_Coupon 2_Freebees              2 " Coupon"     "  Freebees" 
3 1_Coupon 2_Vouchers              1 " Coupon"     "  Vouchers" 
4 1_Email  2_Email                 1 " Email"      "  Email"    
5 1_Email  2_Freebees              1 " Email"      "  Freebees" 
6 1_Email  2_Vouchers              2 " Email"      "  Vouchers"

Step 5:Setting up Node and label for Sankey Chart

Node labels for a Sankey chart will be the unique values from Actual_Source and Actual_Target variable.

nm is nothing but node numbers starting from 0.These are numbers given to Node labels

Here we are also trying to ensure that labels with the same name(without padded spaces) gets the same color in the chart

lb<-unique(as.character(c(final.df$Source_Actual,final.df$Target_Actual)))

ld_df<-data.frame(label = lb,
                  nm = 0:(length(lb)-1))%>%
  mutate(actual_label=trimws(label))%>%
  mutate(color=ifelse(actual_label=="Coupon","blue",
                      ifelse(actual_label=="Email","red",
                             ifelse(actual_label=="Freebees","orange",
                                    ifelse(actual_label=="Vouchers","grey","black")))))

Step 6:Creating the Sankey Chart

Getting the node numbers for Actual_Source and Actual_Target by joining with lb_df

final.df2<-final.df%>%
  left_join(ld_df%>%
              select(label,nm)%>%
              rename(Source_Actual=label),"Source_Actual")%>%
  rename(source_label=nm)%>%
  left_join(ld_df%>%
              select(label,nm)%>%
              rename(Target_Actual=label),"Target_Actual")%>%
  rename(target_label=nm)

fig <- plot_ly(
  type = "sankey",
  orientation = "h",
  
  node = list(
    label = lb,
    color = ld_df$color,
    pad = 15,
    thickness = 20,
    line = list(
      color = "black",
      width = 0.5
    )
  ),
  
  link = list(
    source = as.character(final.df2$source_label),
    target = as.character(final.df2$target_label),
    value =  final.df2$Total_Customer
  )
)
fig <- fig %>% layout(
  title = "Customer Marketing Journey",
  font = list(
    size = 10
  )
)

fig

Final Note

We saw how a powerful and interactive Sankey chart can be created using plotly library using the steps discussed in the blog.This set up can be used in pharma claims data to create patient journeys, do cash flow movement within various share classes and so on

R Complete Guide

https://www.aimlmadeeasy.com/2020/06/r-complete-guide.html

My Youtube Channel

Link

Machine Learning Made Easy

Sunday, January 9, 2022

Customer Journey Analysis using R(plotly)-Part 2