Sunday, January 9, 2022

Customer Journey Analysis using R(plotly)-Part 2

Customer Journey Analysis


Introduction

Customers are the single most important entity in any value chain.Everything and anything under the sun is designed to please customer so that they buy products,use medicines/drugs and what not. They are the ones who actually move the needle and hence it is important to understand what is their different touchpoints with the external stimulus.In other words, how exactly they move in their lets say shopping experience helps companies fine tune the touchpoints to better identify selling opportunities.In this blog, we will take a small dummy example to understand how different customers go through the marketing campaigns driven by a company

We will be using plotly library to create the journey plots or the Sankey Charts.The main task here is to create the dataset that will be used as an input for creating Sankey Chart

package.name<-c("dplyr","stringr","plotly")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

Mocking up Input Data

This is a sample data of 20 customers and how various marketing channels are used to target it

set.seed(42)
individual <- as.character(rep(1:20,each=5))
timeperiod <- paste0(rep(c(0, 18,36,54,72),20),"_week")
Type <- factor(sample(c("Email", "Freebees", "Coupon",  "Freebees","Vouchers"), 100, replace=T))
d <- data.frame(individual, timeperiod, Type)
head(d)
  individual timeperiod     Type
1          1     0_week    Email
2          1    18_week Vouchers
3          1    36_week    Email
4          1    54_week    Email
5          1    72_week Freebees
6          2     0_week Freebees
  • There are 20 customers in this data
  • Each customer is targeted in 0,18,36,54 and 72nd
  • Email,Vouchers,Freebees and Coupons are the channels used to target customers


Understanding the data steps

Below are the basic steps laid out to convert transnational data into structure suitable for creating Sankey chart.This step can also be taken as a workflow that creates source to target type of model


Brief description of the steps are :

  • Step 1:Creating the data at individual, time and channel level which is sorted on time
  • Step 2:Compressing the channel(separated by |) at individual level
  • Step 3:Adding source and target fields based on sequence
  • Step 4:Padding source and target fields based on sequence number


Step 1:Creating the data at individual, time and channel level which is sorted on time

We already have the data at the said granularity.If the data is not in the desired form, we should use aggregation

Step 2:Compressing the channel

d1<-d%>%
  select(individual,Type)%>%
  group_by(individual)%>%
  summarise(Promo_Combined = gsub(", ","|",toString(Type)))%>%
  ungroup()%>%
  group_by(Promo_Combined)%>%
  summarise(Total_Count=n())

# Renaming to give generic names

colnames(d1)<-c("Combined_Key","Total_Records")
head(d1)
# A tibble: 6 x 2
  Combined_Key                               Total_Records
  <chr>                                              <int>
1 Coupon|Email|Email|Coupon|Freebees                     1
2 Coupon|Freebees|Freebees|Freebees|Freebees             1
3 Coupon|Freebees|Vouchers|Email|Freebees                1
4 Coupon|Vouchers|Vouchers|Freebees|Vouchers             1
5 Email|Email|Vouchers|Coupon|Email                      1
6 Email|Freebees|Vouchers|Freebees|Email                 1

Step 3:Adding source and target

For every compressed channel at an individual level, create source and target fields by traversing through the length of the compressed value and assign source and target values in each iteration

l0<-list()
for(k in 1:nrow(d1)){
  
  path_descr<-str_split(d1[k,"Combined_Key"],'[|]')[[1]]
  len<-length(str_split(d1[k,"Combined_Key"],'[|]')[[1]])
  
  l1<-list()
  for(i in 1:len){
    
    if(len!=1 & i!=len){
      
      interim.df<-data.frame(Source=str_c(i,"_",path_descr[i]),
                             Target=str_c(i+1,"_",path_descr[i+1]),
                             Total_Count=d1[k,"Total_Records"])
    
      l1[[i]]<-interim.df
      
    }else{
      
      interim.df<-data.frame(Source=str_c(i,"_",path_descr[i]),
                             Target=str_c(i+1,"_","End of Activity"),
                             Total_Count=d1[k,"Total_Records"])
      
      l1[[i]]<-interim.df
      
    }
  }
  
  l0[[k]]<-do.call(rbind.data.frame,l1)
  
  
}

interim.df<-do.call(rbind.data.frame,l0)%>%
  filter(!is.na(Target))

head(interim.df)
      Source            Target Total_Records
1   1_Coupon           2_Email             1
2    2_Email           3_Email             1
3    3_Email          4_Coupon             1
4   4_Coupon        5_Freebees             1
5 5_Freebees 6_End of Activity             1
6   1_Coupon        2_Freebees             1

Step 4:Padding up source and target

The steps in the code have been labelled as #1, #2,#3 and #4.Their description is as below

  • 1: Removing NA values from target
  • 2: Aggregating total customer count at source and target level
  • 3: Creating source padded variable as Actual_source
  • 4: Creating target padded variable as Actual_target


final.df<-interim.df%>%
  filter(!is.na(Target))%>% # 1
  group_by(Source,Target)%>%
  summarise(Total_Customer=n())%>% # 2
  ungroup()%>%
  mutate(Source_Actual=sapply(Source,function(x){ # 3
    
   # x<-"2_Discount"  
   padding<- as.numeric(str_split(x,"_")[[1]][1])
   src_Val<- str_split(x,"_")[[1]][2]
   z<-str_pad(src_Val, max(nchar(src_Val))+padding, pad = " ", side = "left")
   return(z)
    
  }))%>%
  mutate(Target_Actual=sapply(Target,function(x){ # 4
    
    # x<-"2_Discount"  
    padding<- as.numeric(str_split(x,"_")[[1]][1])
    src_Val<- str_split(x,"_")[[1]][2]
    z<-str_pad(src_Val, max(nchar(src_Val))+padding, pad = " ", side = "left")
    return(z)
    
  }))

head(final.df)
# A tibble: 6 x 5
  Source   Target     Total_Customer Source_Actual Target_Actual
  <chr>    <chr>               <int> <chr>         <chr>        
1 1_Coupon 2_Email                 1 " Coupon"     "  Email"    
2 1_Coupon 2_Freebees              2 " Coupon"     "  Freebees" 
3 1_Coupon 2_Vouchers              1 " Coupon"     "  Vouchers" 
4 1_Email  2_Email                 1 " Email"      "  Email"    
5 1_Email  2_Freebees              1 " Email"      "  Freebees" 
6 1_Email  2_Vouchers              2 " Email"      "  Vouchers" 


Step 5:Setting up Node and label for Sankey Chart

Node labels for a Sankey chart will be the unique values from Actual_Source and Actual_Target variable.

nm is nothing but node numbers starting from 0.These are numbers given to Node labels

Here we are also trying to ensure that labels with the same name(without padded spaces) gets the same color in the chart

lb<-unique(as.character(c(final.df$Source_Actual,final.df$Target_Actual)))

ld_df<-data.frame(label = lb,
                  nm = 0:(length(lb)-1))%>%
  mutate(actual_label=trimws(label))%>%
  mutate(color=ifelse(actual_label=="Coupon","blue",
                      ifelse(actual_label=="Email","red",
                             ifelse(actual_label=="Freebees","orange",
                                    ifelse(actual_label=="Vouchers","grey","black")))))


Step 6:Creating the Sankey Chart

Getting the node numbers for Actual_Source and Actual_Target by joining with lb_df

final.df2<-final.df%>%
  left_join(ld_df%>%
              select(label,nm)%>%
              rename(Source_Actual=label),"Source_Actual")%>%
  rename(source_label=nm)%>%
  left_join(ld_df%>%
              select(label,nm)%>%
              rename(Target_Actual=label),"Target_Actual")%>%
  rename(target_label=nm)


fig <- plot_ly(
  type = "sankey",
  orientation = "h",
  
  node = list(
    label = lb,
    color = ld_df$color,
    pad = 15,
    thickness = 20,
    line = list(
      color = "black",
      width = 0.5
    )
  ),
  
  link = list(
    source = as.character(final.df2$source_label),
    target = as.character(final.df2$target_label),
    value =  final.df2$Total_Customer
  )
)
fig <- fig %>% layout(
  title = "Customer Marketing Journey",
  font = list(
    size = 10
  )
)

fig


Final Note

We saw how a powerful and interactive Sankey chart can be created using plotly library using the steps discussed in the blog.This set up can be used in pharma claims data to create patient journeys, do cash flow movement within various share classes and so on

My Youtube Channel

No comments:

Post a Comment

Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...