Customer Journey Analysis
Parag Verma
10th Jan, 2022
Introduction
Customers are the single most important entity in any value chain.Everything and anything under the sun is designed to please customer so that they buy products,use medicines/drugs and what not. They are the ones who actually move the needle and hence it is important to understand what is their different touchpoints with the external stimulus.In other words, how exactly they move in their lets say shopping experience helps companies fine tune the touchpoints to better identify selling opportunities.In this blog, we will take a small dummy example to understand how different customers go through the marketing campaigns driven by a company
We will be using plotly library to create the journey plots or the Sankey Charts.The main task here is to create the dataset that will be used as an input for creating Sankey Chart
package.name<-c("dplyr","stringr","plotly")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
Mocking up Input Data
This is a sample data of 20 customers and how various marketing channels are used to target it
set.seed(42)
individual <- as.character(rep(1:20,each=5))
timeperiod <- paste0(rep(c(0, 18,36,54,72),20),"_week")
Type <- factor(sample(c("Email", "Freebees", "Coupon", "Freebees","Vouchers"), 100, replace=T))
d <- data.frame(individual, timeperiod, Type)
head(d)
individual timeperiod Type
1 1 0_week Email
2 1 18_week Vouchers
3 1 36_week Email
4 1 54_week Email
5 1 72_week Freebees
6 2 0_week Freebees
- There are 20 customers in this data
- Each customer is targeted in 0,18,36,54 and 72nd
- Email,Vouchers,Freebees and Coupons are the channels used to target customers
Understanding the data steps
Below are the basic steps laid out to convert transnational data into structure suitable for creating Sankey chart.This step can also be taken as a workflow that creates source to target type of model
Brief description of the steps are :
- Step 1:Creating the data at individual, time and channel level which is sorted on time
- Step 2:Compressing the channel(separated by |) at individual level
- Step 3:Adding source and target fields based on sequence
- Step 4:Padding source and target fields based on sequence number
Step 1:Creating the data at individual, time and channel level which is sorted on time
We already have the data at the said granularity.If the data is not in the desired form, we should use aggregation
Step 2:Compressing the channel
d1<-d%>%
select(individual,Type)%>%
group_by(individual)%>%
summarise(Promo_Combined = gsub(", ","|",toString(Type)))%>%
ungroup()%>%
group_by(Promo_Combined)%>%
summarise(Total_Count=n())
# Renaming to give generic names
colnames(d1)<-c("Combined_Key","Total_Records")
head(d1)
# A tibble: 6 x 2
Combined_Key Total_Records
<chr> <int>
1 Coupon|Email|Email|Coupon|Freebees 1
2 Coupon|Freebees|Freebees|Freebees|Freebees 1
3 Coupon|Freebees|Vouchers|Email|Freebees 1
4 Coupon|Vouchers|Vouchers|Freebees|Vouchers 1
5 Email|Email|Vouchers|Coupon|Email 1
6 Email|Freebees|Vouchers|Freebees|Email 1
Step 3:Adding source and target
For every compressed channel at an individual level, create source and target fields by traversing through the length of the compressed value and assign source and target values in each iteration
l0<-list()
for(k in 1:nrow(d1)){
path_descr<-str_split(d1[k,"Combined_Key"],'[|]')[[1]]
len<-length(str_split(d1[k,"Combined_Key"],'[|]')[[1]])
l1<-list()
for(i in 1:len){
if(len!=1 & i!=len){
interim.df<-data.frame(Source=str_c(i,"_",path_descr[i]),
Target=str_c(i+1,"_",path_descr[i+1]),
Total_Count=d1[k,"Total_Records"])
l1[[i]]<-interim.df
}else{
interim.df<-data.frame(Source=str_c(i,"_",path_descr[i]),
Target=str_c(i+1,"_","End of Activity"),
Total_Count=d1[k,"Total_Records"])
l1[[i]]<-interim.df
}
}
l0[[k]]<-do.call(rbind.data.frame,l1)
}
interim.df<-do.call(rbind.data.frame,l0)%>%
filter(!is.na(Target))
head(interim.df)
Source Target Total_Records
1 1_Coupon 2_Email 1
2 2_Email 3_Email 1
3 3_Email 4_Coupon 1
4 4_Coupon 5_Freebees 1
5 5_Freebees 6_End of Activity 1
6 1_Coupon 2_Freebees 1
Step 4:Padding up source and target
The steps in the code have been labelled as #1, #2,#3 and #4.Their description is as below
- 1: Removing NA values from target
- 2: Aggregating total customer count at source and target level
- 3: Creating source padded variable as Actual_source
- 4: Creating target padded variable as Actual_target
final.df<-interim.df%>%
filter(!is.na(Target))%>% # 1
group_by(Source,Target)%>%
summarise(Total_Customer=n())%>% # 2
ungroup()%>%
mutate(Source_Actual=sapply(Source,function(x){ # 3
# x<-"2_Discount"
padding<- as.numeric(str_split(x,"_")[[1]][1])
src_Val<- str_split(x,"_")[[1]][2]
z<-str_pad(src_Val, max(nchar(src_Val))+padding, pad = " ", side = "left")
return(z)
}))%>%
mutate(Target_Actual=sapply(Target,function(x){ # 4
# x<-"2_Discount"
padding<- as.numeric(str_split(x,"_")[[1]][1])
src_Val<- str_split(x,"_")[[1]][2]
z<-str_pad(src_Val, max(nchar(src_Val))+padding, pad = " ", side = "left")
return(z)
}))
head(final.df)
# A tibble: 6 x 5
Source Target Total_Customer Source_Actual Target_Actual
<chr> <chr> <int> <chr> <chr>
1 1_Coupon 2_Email 1 " Coupon" " Email"
2 1_Coupon 2_Freebees 2 " Coupon" " Freebees"
3 1_Coupon 2_Vouchers 1 " Coupon" " Vouchers"
4 1_Email 2_Email 1 " Email" " Email"
5 1_Email 2_Freebees 1 " Email" " Freebees"
6 1_Email 2_Vouchers 2 " Email" " Vouchers"
Step 5:Setting up Node and label for Sankey Chart
Node labels for a Sankey chart will be the unique values from Actual_Source and Actual_Target variable.
nm is nothing but node numbers starting from 0.These are numbers given to Node labels
Here we are also trying to ensure that labels with the same name(without padded spaces) gets the same color in the chart
lb<-unique(as.character(c(final.df$Source_Actual,final.df$Target_Actual)))
ld_df<-data.frame(label = lb,
nm = 0:(length(lb)-1))%>%
mutate(actual_label=trimws(label))%>%
mutate(color=ifelse(actual_label=="Coupon","blue",
ifelse(actual_label=="Email","red",
ifelse(actual_label=="Freebees","orange",
ifelse(actual_label=="Vouchers","grey","black")))))
Step 6:Creating the Sankey Chart
Getting the node numbers for Actual_Source and Actual_Target by joining with lb_df
final.df2<-final.df%>%
left_join(ld_df%>%
select(label,nm)%>%
rename(Source_Actual=label),"Source_Actual")%>%
rename(source_label=nm)%>%
left_join(ld_df%>%
select(label,nm)%>%
rename(Target_Actual=label),"Target_Actual")%>%
rename(target_label=nm)
fig <- plot_ly(
type = "sankey",
orientation = "h",
node = list(
label = lb,
color = ld_df$color,
pad = 15,
thickness = 20,
line = list(
color = "black",
width = 0.5
)
),
link = list(
source = as.character(final.df2$source_label),
target = as.character(final.df2$target_label),
value = final.df2$Total_Customer
)
)
fig <- fig %>% layout(
title = "Customer Marketing Journey",
font = list(
size = 10
)
)
fig
Final Note
We saw how a powerful and interactive Sankey chart can be created using plotly library using the steps discussed in the blog.This set up can be used in pharma claims data to create patient journeys, do cash flow movement within various share classes and so on