Thursday, January 13, 2022

Customer Journey Analysis Python(Sankey Chart)

Customer Journey Analysis using Sankey Charts
In [459]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:103% !important;margin-left: -2%; }</style>"))

Introduction

Customers are the single most important entity in any value chain.Everything and anything under the sun is designed to please customer so that they buy products,use medicines/drugs and what not. They are the ones who actually move the needle and hence it is important to understand what is their different touchpoints with the external stimulus.In other words, how exactly they move in their lets say shopping experience helps companies fine tune the touchpoints to better identify selling opportunities.In this blog, we will take a small dummy example to understand how different customers go through the marketing campaigns driven by a company

We will be using plotly library to create the journey plots or the Sankey Charts.The main task here is to create the dataset that will be used as an input for creating Sankey Chart

In [5]:
import pandas as pd
import numpy as np
import random

Mocking up Input Data

In [11]:
individual = np.repeat(list(range(1,21)),5)
timperiod = ['0_week','18_week','36_week','54_week','72_week']*20
Type = np.random.choice(['Email','Freebees','Coupon','Freebees','Vouchers'],100)


df = pd.DataFrame(np.c_[individual,timperiod,Type])
df.columns = ['Individual','timperiod','Type']
df.head()
Out[11]:
Individual timperiod Type
0 1 0_week Freebees
1 1 18_week Freebees
2 1 36_week Vouchers
3 1 54_week Freebees
4 1 72_week Vouchers
  • There are 20 customers in this data
  • Each customer is targeted in 0,18,36,54 and 72nd Week
  • Email,Vouchers,Freebees and Coupons are the channels used to target customers

In [464]:
from IPython.display import Image
Image(filename="basic_workflow.jpg")
Out[464]:

Brief description of the steps are :

  • Step 1:Creating the data at individual, time and channel level which is sorted on time
  • Step 2:Compressing the channel(separated by |) at individual level
  • Step 3:Adding source and target fields based on sequence
  • Step 4:Padding source and target fields based on sequence number
In [25]:
individual = np.repeat(list(range(1,21)),5)
print(timperiod)
['0_week' '0_week' '0_week' '0_week' '0_week' '18_week' '18_week'
 '18_week' '18_week' '18_week' '36_week' '36_week' '36_week' '36_week'
 '36_week' '54_week' '54_week' '54_week' '54_week' '54_week' '72_week'
 '72_week' '72_week' '72_week' '72_week']

Step 1:Creating the data at individual,time and channel level which is sorted on time

We already have the data at the said granularity.If the data is not in the desired form, we should use aggregation

Step 2:Compressing the channel

In [13]:
df.head()
Out[13]:
Individual timperiod Type
0 1 0_week Freebees
1 1 18_week Freebees
2 1 36_week Vouchers
3 1 54_week Freebees
4 1 72_week Vouchers
In [89]:
df['Promo_Combined'] = df.groupby('Individual')['Type'].transform(lambda x: '|'.join(x))
df2 = df[['Individual','Promo_Combined']].drop_duplicates()
d0 = pd.DataFrame(df2['Promo_Combined'])
d0.head()
Out[89]:
Promo_Combined
0 Freebees|Freebees|Vouchers|Freebees|Vouchers
5 Freebees|Coupon|Vouchers|Freebees|Freebees
10 Freebees|Coupon|Freebees|Email|Email
15 Vouchers|Freebees|Freebees|Email|Freebees
20 Vouchers|Freebees|Email|Coupon|Coupon
In [90]:
d1=pd.DataFrame(d0.groupby('Promo_Combined').size())
d1.reset_index(inplace=True)
d1.columns=[['Promo_Combined','Total_Records']]
d1.head()
Out[90]:
Promo_Combined Total_Records
0 Coupon|Freebees|Freebees|Email|Freebees 1
1 Coupon|Freebees|Vouchers|Email|Vouchers 1
2 Coupon|Vouchers|Freebees|Vouchers|Vouchers 1
3 Email|Freebees|Vouchers|Freebees|Email 1
4 Email|Vouchers|Vouchers|Freebees|Freebees 1
In [178]:
d1=d0.groupby('Promo_Combined').size().reset_index(name='Total_Records')
d1.head()
Out[178]:
Promo_Combined Total_Records
0 Coupon|Freebees|Freebees|Email|Freebees 1
1 Coupon|Freebees|Vouchers|Email|Vouchers 1
2 Coupon|Vouchers|Freebees|Vouchers|Vouchers 1
3 Email|Freebees|Vouchers|Freebees|Email 1
4 Email|Vouchers|Vouchers|Freebees|Freebees 1

Step 3:Adding source and target

For every compressed channel at an individual level, create source and target fields by traversing through the length of the compressed value and assign source and target values in each iteration

Logic building using single individual example

In [184]:
test=d1['Promo_Combined'][:1]
test
Out[184]:
0    Coupon|Freebees|Freebees|Email|Freebees
Name: Promo_Combined, dtype: object

Getting individual components

In [185]:
path_descr=test[0].split("|")
path_descr
Out[185]:
['Coupon', 'Freebees', 'Freebees', 'Email', 'Freebees']

Getting length of the element

In [186]:
path_len=len(test[0].split("|"))
path_len
Out[186]:
5
In [187]:
Total_Count=pd.Series(d1['Total_Records'][:1])
Out[187]:
0    1
Name: Total_Records, dtype: int64

Now we have to traverse through path_descr and as we go along we will create source and target values where the first element in each iteration will be the source and next element will be the target

In [188]:
l1=[]
for i in range(0,path_len):
    
    if(path_len!=1 and i!=(path_len-1)):
        
        source=pd.Series(str(i+1) + "_" + path_descr[i])
        target=pd.Series(str(i+2) + "_" + path_descr[i+1])
        Total_Count=pd.Series(d1['Total_Records'][:1])
        interim_df0=pd.DataFrame(np.c_[source,target,Total_Count],columns = ["source","target","Total_Count"])
        l1.append(interim_df0)
    else:
        source=pd.Series(str(i+1) + "_" + path_descr[i])
        target=pd.Series("End of Activity")
        Total_Count=pd.Series(d1['Total_Records'][:1])
        interim_df0=pd.DataFrame(np.c_[source,target,Total_Count],columns = ["source","target","Total_Count"])
        l1.append(interim_df0)

interim_df=pd.concat(l1)
interim_df.head()
Out[188]:
source target Total_Count
0 1_Coupon 2_Freebees 1
0 2_Freebees 3_Freebees 1
0 3_Freebees 4_Email 1
0 4_Email 5_Freebees 1
0 5_Freebees End of Activity 1

We saw how the compressed sequence '['Coupon', 'Freebees', 'Freebees', 'Email', 'Freebees']' was converted into source and target data frame

In [190]:
d1.shape[0]
Out[190]:
19

Repeating the above process for all sequences in d1

In [258]:
l1=[]

for j in range(0,d1.shape[0]):
    test=d1['Promo_Combined'][j]
    path_descr=test.split("|")
    path_len=len(test.split("|"))
    
    l2=[]
    for i in range(0,path_len):

        if(path_len!=1 and i!=(path_len-1)):

            source=pd.Series(str(i+1) + "_" + path_descr[i])
            target=pd.Series(str(i+2) + "_" + path_descr[i+1])
            Total_Count=pd.Series(d1['Total_Records'][:1])
            interim_df0=pd.DataFrame(np.c_[source,target,Total_Count],columns = ["source","target","Total_Count"])
            l2.append(interim_df0)
        else:
            source=pd.Series(str(i+1) + "_" + path_descr[i])
            target=pd.Series("End of Activity")
            Total_Count=pd.Series(d1['Total_Records'][:1])
            interim_df0=pd.DataFrame(np.c_[source,target,Total_Count],columns = ["source","target","Total_Count"])
            l2.append(interim_df0)

    interim_df=pd.concat(l2)
    l1.append(interim_df)
    
In [264]:
interim_df2=pd.concat(l1)
interim_df2.head(10)
Out[264]:
source target Total_Count
0 1_Coupon 2_Freebees 1
0 2_Freebees 3_Freebees 1
0 3_Freebees 4_Email 1
0 4_Email 5_Freebees 1
0 5_Freebees End of Activity 1
0 1_Coupon 2_Freebees 1
0 2_Freebees 3_Vouchers 1
0 3_Vouchers 4_Email 1
0 4_Email 5_Vouchers 1
0 5_Vouchers End of Activity 1

Step 4:Padding up source and target

The steps in the code have been labelled as # 1, # 2,# 3 and # 4.Their description is as below

  • 1: Removing NA values
  • 2:Aggregating total customer count at source and target level
  • 3:Creating source padded variable as Actual_source
  • 4:Creating target padded variable as Actual_target

1: Removing NA values

In [269]:
interim_df3=interim_df2.dropna()
interim_df3.head()
Out[269]:
source target Total_Count
0 1_Coupon 2_Freebees 1
0 2_Freebees 3_Freebees 1
0 3_Freebees 4_Email 1
0 4_Email 5_Freebees 1
0 5_Freebees End of Activity 1

2:Aggregating total customer count at source and target level

In [271]:
interim_df4=interim_df3.groupby(['source','target'])['Total_Count'].sum().reset_index(name='Total_Records')
interim_df4.head()
Out[271]:
source target Total_Records
0 1_Coupon 2_Freebees 2
1 1_Coupon 2_Vouchers 1
2 1_Email 2_Freebees 1
3 1_Email 2_Vouchers 1
4 1_Freebees 2_Coupon 2
In [408]:
def space_padding(x):
    y=x.split("_")[0]
    
    try:
        y1=int(y)
        y2=y1*"_"
        z=y2 + x.split("_")[1]
    except:
        z=y

    return(z)
    
In [409]:
# Testing the function
space_padding("3_Vouchers")
Out[409]:
'___Vouchers'

Adding '_' padding source

In [410]:
source_ls=list(interim_df4['source'])
Actual_source=[space_padding(x) for x in source_ls]
print(Actual_source)
['_Coupon', '_Coupon', '_Email', '_Email', '_Freebees', '_Freebees', '_Freebees', '_Freebees', '_Vouchers', '__Coupon', '__Coupon', '__Email', '__Freebees', '__Freebees', '__Freebees', '__Freebees', '__Vouchers', '__Vouchers', '___Coupon', '___Email', '___Email', '___Freebees', '___Freebees', '___Freebees', '___Vouchers', '___Vouchers', '___Vouchers', '____Coupon', '____Email', '____Email', '____Email', '____Email', '____Freebees', '____Freebees', '____Freebees', '____Freebees', '____Vouchers', '____Vouchers', '____Vouchers', '_____Coupon', '_____Email', '_____Freebees', '_____Vouchers']

Adding '_' padding target

In [411]:
target_ls=list(interim_df4['target'])
Actual_target=[space_padding(x) for x in target_ls]
print(Actual_target)
['__Freebees', '__Vouchers', '__Freebees', '__Vouchers', '__Coupon', '__Email', '__Freebees', '__Vouchers', '__Freebees', '___Freebees', '___Vouchers', '___Coupon', '___Coupon', '___Email', '___Freebees', '___Vouchers', '___Freebees', '___Vouchers', '____Freebees', '____Coupon', '____Vouchers', '____Email', '____Freebees', '____Vouchers', '____Email', '____Freebees', '____Vouchers', '_____Coupon', '_____Coupon', '_____Email', '_____Freebees', '_____Vouchers', '_____Coupon', '_____Email', '_____Freebees', '_____Vouchers', '_____Coupon', '_____Freebees', '_____Vouchers', 'End of Activity', 'End of Activity', 'End of Activity', 'End of Activity']
In [425]:
interim_df4['Actual_source']=Actual_source
interim_df4['Actual_target']=Actual_target
interim_df4.head()
Out[425]:
source target Total_Records Actual_source Actual_target Actual_source2
0 1_Coupon 2_Freebees 2 _Coupon __Freebees Coupon
1 1_Coupon 2_Vouchers 1 _Coupon __Vouchers Coupon
2 1_Email 2_Freebees 1 _Email __Freebees Email
3 1_Email 2_Vouchers 1 _Email __Vouchers Email
4 1_Freebees 2_Coupon 2 _Freebees __Coupon Freebees
In [429]:
interim_df4['Actual_source2'] = interim_df4['Actual_source'].apply(lambda x: x.replace("_", " "))
interim_df4['Actual_target2'] = interim_df4['Actual_target'].apply(lambda x: x.replace("_", " "))
interim_df4.head(10)
Out[429]:
source target Total_Records Actual_source Actual_target Actual_source2 Actual_target2
0 1_Coupon 2_Freebees 2 _Coupon __Freebees Coupon Freebees
1 1_Coupon 2_Vouchers 1 _Coupon __Vouchers Coupon Vouchers
2 1_Email 2_Freebees 1 _Email __Freebees Email Freebees
3 1_Email 2_Vouchers 1 _Email __Vouchers Email Vouchers
4 1_Freebees 2_Coupon 2 _Freebees __Coupon Freebees Coupon
5 1_Freebees 2_Email 1 _Freebees __Email Freebees Email
6 1_Freebees 2_Freebees 4 _Freebees __Freebees Freebees Freebees
7 1_Freebees 2_Vouchers 3 _Freebees __Vouchers Freebees Vouchers
8 1_Vouchers 2_Freebees 4 _Vouchers __Freebees Vouchers Freebees
9 2_Coupon 3_Freebees 1 __Coupon ___Freebees Coupon Freebees

Step 5:Setting up Node and label for Sankey Chart

In [431]:
lb_source=list(np.unique(np.array(interim_df4['Actual_source2'])))
lb_target=list(np.unique(np.array(interim_df4['Actual_target2'])))
lb_combined=list(np.unique(np.array(lb_source + lb_target)))
print(lb_combined)
['     Coupon', '     Email', '     Freebees', '     Vouchers', '    Coupon', '    Email', '    Freebees', '    Vouchers', '   Coupon', '   Email', '   Freebees', '   Vouchers', '  Coupon', '  Email', '  Freebees', '  Vouchers', ' Coupon', ' Email', ' Freebees', ' Vouchers', 'End of Activity']

Step 6:Setting up Node number and color for nodes

In [432]:
node_number=list(range(len(lb_combined)))
node_number
Out[432]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
In [433]:
actual_label=[x.lstrip() for x in lb_combined]
print(actual_label)
['Coupon', 'Email', 'Freebees', 'Vouchers', 'Coupon', 'Email', 'Freebees', 'Vouchers', 'Coupon', 'Email', 'Freebees', 'Vouchers', 'Coupon', 'Email', 'Freebees', 'Vouchers', 'Coupon', 'Email', 'Freebees', 'Vouchers', 'End of Activity']
In [438]:
lb_df=pd.DataFrame(np.c_[lb_combined,actual_label,node_number])
lb_df.columns=['label','actual_label','node_number']
lb_df.head()
Out[438]:
label actual_label node_number
0 Coupon Coupon 0
1 Email Email 1
2 Freebees Freebees 2
3 Vouchers Vouchers 3
4 Coupon Coupon 4
In [439]:
color=np.where(lb_df.actual_label == 'Coupon',"blue",
         np.where(lb_df.actual_label == 'Email',"red",
                 np.where(lb_df.actual_label == 'Freebees',"orange",
                         np.where(lb_df.actual_label == 'Vouchers',"grey","black"))))

lb_df['color']=color
lb_df.head()
Out[439]:
label actual_label node_number color
0 Coupon Coupon 0 blue
1 Email Email 1 red
2 Freebees Freebees 2 orange
3 Vouchers Vouchers 3 grey
4 Coupon Coupon 4 blue

Step 7:Creating the Sankey Chart

In [440]:
interim_df4.head(2)
Out[440]:
source target Total_Records Actual_source Actual_target Actual_source2 Actual_target2
0 1_Coupon 2_Freebees 2 _Coupon __Freebees Coupon Freebees
1 1_Coupon 2_Vouchers 1 _Coupon __Vouchers Coupon Vouchers
In [441]:
lb_df.head(2)
Out[441]:
label actual_label node_number color
0 Coupon Coupon 0 blue
1 Email Email 1 red

Getting node number for source labels

In [447]:
lb_df=lb_df.rename({'label': 'Actual_source2'}, axis=1)
lb_df_interim=lb_df[['Actual_source2','node_number']]
interim_df5=interim_df4.merge(lb_df_interim)
interim_df5=interim_df5.rename({'node_number': 'source_label'}, axis=1)
interim_df5.head()
Out[447]:
source target Total_Records Actual_source Actual_target Actual_source2 Actual_target2 source_label
0 1_Coupon 2_Freebees 2 _Coupon __Freebees Coupon Freebees 16
1 1_Coupon 2_Vouchers 1 _Coupon __Vouchers Coupon Vouchers 16
2 1_Email 2_Freebees 1 _Email __Freebees Email Freebees 17
3 1_Email 2_Vouchers 1 _Email __Vouchers Email Vouchers 17
4 1_Freebees 2_Coupon 2 _Freebees __Coupon Freebees Coupon 18

Getting node number for target labels

In [452]:
lb_df=lb_df.rename({'Actual_source2': 'Actual_target2'}, axis=1)
lb_df.head()
Out[452]:
Actual_target2 actual_label node_number color
0 Coupon Coupon 0 blue
1 Email Email 1 red
2 Freebees Freebees 2 orange
3 Vouchers Vouchers 3 grey
4 Coupon Coupon 4 blue
In [454]:
lb_df_interim=lb_df[['Actual_target2','node_number']]
interim_df6=interim_df5.merge(lb_df_interim)
interim_df6=interim_df6.rename({'node_number': 'target_label'}, axis=1)
interim_df6.head()
Out[454]:
source target Total_Records Actual_source Actual_target Actual_source2 Actual_target2 source_label target_label
0 1_Coupon 2_Freebees 2 _Coupon __Freebees Coupon Freebees 16 14
1 1_Email 2_Freebees 1 _Email __Freebees Email Freebees 17 14
2 1_Freebees 2_Freebees 4 _Freebees __Freebees Freebees Freebees 18 14
3 1_Vouchers 2_Freebees 4 _Vouchers __Freebees Vouchers Freebees 19 14
4 1_Coupon 2_Vouchers 1 _Coupon __Vouchers Coupon Vouchers 16 15
In [462]:
import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode()
In [463]:
fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 15,
      thickness = 20,
      line = dict(color = "black", width = 0.5),
      label = lb_combined,
      color = list(lb_df['color'])
    ),
    link = dict(
      source = list(interim_df6['source_label']),
      target = list(interim_df6['target_label']),
      value = list(interim_df6['Total_Records'])
  ))])

fig.update_layout(title_text="Customer Marketing Journey", font_size=10)
fig.show()

Embed Shiny

Please wait...