Wednesday, January 19, 2022

Retail Targeting using Machine Learning

Retail Targeting using Machine Learning


Introduction

Targeting is usually done as part of Marketing exercise where customers are divided into segments and then specific activities are designed for each segment so that customers responds to these drives.In the current blog, we will look at a retails data and try to use Machine Learning to identify potential customers that should be targeted as part of promotional activities


Step 1: Read Data

The onlineretail dataset is part of onlineretail package. Packages such as ROCR and plotROC are imported to calculate certain evaluation metrics post modelling exercise. Lets import the dataset and look at the first few records

  InvoiceNo StockCode                         Description Quantity
1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
2    536365     71053                 WHITE METAL LANTERN        6
3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2
          InvoiceDate UnitPrice CustomerID        Country
1 2010-12-01 08:26:00      2.55      17850 United Kingdom
2 2010-12-01 08:26:00      3.39      17850 United Kingdom
3 2010-12-01 08:26:00      2.75      17850 United Kingdom
4 2010-12-01 08:26:00      3.39      17850 United Kingdom
5 2010-12-01 08:26:00      3.39      17850 United Kingdom
6 2010-12-01 08:26:00      7.65      17850 United Kingdom

The brief description of the columns in onlineretail are given below:

  • InvoiceNo: Integral number uniquely assigned to each transaction
  • StockCode: A number uniquely assigned to each product
  • Description: Product Description
  • Quantity: Units order for every transaction
  • InvoiceDate: Date of making the transaction
  • UnitPrice: dollars amount for each unit
  • CustomerID: A number uniquely assigned to each customer
  • Country: The country where a customer resides


In this analysis, we will identify most valued customers by using their total purchased value as the proxy to create a Binary indicator.We will then use a logistic regression to assign probability to each customer. Customers with high probability should be targeted in the next marketing drive


Step 1b: Create additional Features

We will create the following features from the dataset:

  • Frequency: Total number of transaction made by the customer
  • Total_Value: Total dollar value for the transactions done by the customer
  • Time_Spread: Total time difference between first and last transaction of a customer
  • Target: If the total dollar value for a customer is greater than the 80th percentiale of Overall dollar value, then it is “Yes”, otherwise “No”
interim.df<-df%>%
  arrange(Country,InvoiceDate)%>%
  group_by(CustomerID,Country)%>%
  summarise(Min_Date=min(InvoiceDate),
            Max_Date=max(InvoiceDate),
            Frequency=n(),
            Total_Quantity=sum(Quantity),
            Mean_Price=mean(UnitPrice))%>%
  ungroup()%>%
  mutate(Total_Value=Total_Quantity*Mean_Price,
         Time_Spread=Max_Date-Min_Date)%>%
  mutate(Percentile_80=quantile(Total_Value,0.80))%>%
  mutate(Target=ifelse(Total_Value > Percentile_80,"Yes","No" ))%>%
  select(Country,Frequency,Total_Quantity,Total_Value,Time_Spread,Target)%>%
  as.data.frame()%>%
  mutate(Total_Quantity2=ifelse(Total_Quantity <= 0,1,Total_Quantity))%>%
  filter(Total_Value > 0)
  

head(interim.df)
  Country Frequency Total_Quantity Total_Value   Time_Spread Target
1 Iceland       182           2458   6498.9790 31539300 secs    Yes
2 Finland        31           2341  13495.4874 24429840 secs    Yes
3   Italy        73            631   5230.3849        0 secs    Yes
4  Norway        17            197    756.7118        0 secs     No
5  Norway        95            470  10939.1263 22471440 secs    Yes
6 Bahrain         4             20    121.5000        0 secs     No
  Total_Quantity2
1            2458
2            2341
3             631
4             197
5             470
6              20

We have removed customer who have negative Total Quantity.These are basically return orders

Time_Spread currently is in seconds.Lets convert it into days

time_spread<-as.numeric(round(interim.df$Time_Spread/(24*60*60)))
time_spread[1:5]
[1] 365 283   0   0 260
interim.df$Time_Spread2<-time_spread

Step 2: Exploratory Data Analyses

Here we will try to analyse the following:

  • Unique values in the ‘Frequency’ and ‘Target’ variable
  • Mean value of ‘Total_Value’,‘Total_Quantity2’ and ‘Time_Spread2’ variable across unique values of ‘Target’ variable

2a: Unique values in the ‘Target’ variable

table(interim.df[["Target"]])

  No  Yes 
3456  878 

Proportion of customers who purchase is 878/(878+3456) ~ 20% Even though this is typically way less than this(around 1-5%) we will use this as a base to understand some of the key metrics in targeted marketing


2b: Unique values in the ‘Frequency’ variable

Lets look at min andmax dates first

df%>%
  summarise(Min_Date=min(InvoiceDate),
            Max_Date=max(InvoiceDate))
             Min_Date            Max_Date
1 2010-12-01 08:26:00 2011-12-09 12:50:00

So we are looking at 1 year of data.

interim.df%>%
  group_by(Frequency)%>%
  summarise(Total_Customer=n())%>%
  arrange(desc(Total_Customer))%>%
  head(10)
# A tibble: 10 x 2
   Frequency Total_Customer
       <int>          <int>
 1         6             78
 2        10             74
 3         7             72
 4        12             72
 5        11             70
 6         8             68
 7         9             67
 8        16             66
 9        21             66
10        19             62

Maximum numbers of customers have purchased 6 to 12 times in the year


2c: Mean value of ‘Total_Value’,‘Total_Quantity2’ and ‘Time_Spread2’ variable across unique values of ‘Target’ variable

eda.df<-interim.df%>%
  group_by(Target)%>%
  summarise(TotalRecords=n(),
            Total_Quantity=mean(Total_Quantity2),
            Total_Value=mean(Total_Value),
            Time_Spread=mean(Time_Spread2))%>%
  gather(key='Variable',value='Mean',Total_Quantity:Time_Spread)


lapply(c("Total_Quantity","Total_Value","Time_Spread"),function(x){
  
  
  p<-ggplot(data=eda.df%>%
              as.data.frame()%>%
              filter(Variable==x), aes(x=Target, y=Mean)) +
    geom_bar(stat="identity",fill = "orange")+
    xlab("Target")+ylab(paste0("Mean Value of ",x))
  
  return(p)
  
})
[[1]]


[[2]]


[[3]]

We can see that there is a clear difference between mean levels of Total Value,Total Spread2 and Total Quantity2.So these variables should be used to model the potential customers.Also please note that since Target is derived from Total Value, it will not be used in the model


Step 3: Splitting into Train(70%) and Test(30%) Dataset

One should note that the ratio of Yes and No in the target variable should remain almost the same in the test and train datasets.

      Yes 
0.2027695 
      Yes 
0.2021522 

We can see that the proportion of Yes in both the test and train dataset in roughly the same. Hence these two datasets have been created correctly

Step 4: Training Model

Since the range of numeric variables is huge, hence we will take log of Frequency and Total_Quantity2 to improve fit


Call:
glm(formula = Target ~ log(Frequency) + log(Total_Quantity2) + 
    Time_Spread2, family = "binomial", data = train.df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.2763  -0.1770  -0.0311  -0.0018   4.9822  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -2.763e+01  1.326e+00 -20.839  < 2e-16 ***
log(Frequency)        3.736e-01  1.031e-01   3.625 0.000289 ***
log(Total_Quantity2)  3.570e+00  1.935e-01  18.457  < 2e-16 ***
Time_Spread2          1.728e-03  8.033e-04   2.151 0.031470 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3058.59  on 3032  degrees of freedom
Residual deviance:  982.99  on 3029  degrees of freedom
AIC: 990.99

Number of Fisher Scoring iterations: 8

The results can be interpreted as:

  • Frequency has a positive impact of prospective customer identification.If a customer purchases more frequently, then he should be segmented into high valued segment and targeted accordingly
  • Similar inference can be made around Total_Quantity2 and Time_Spread2


Step 5: Getting the optimal threshold value

Normally the aim of all model building exercise is to maximise accuracy. However, we need to be careful about how many False positive(FP) and False negative(FN) cases are generated by the model. In such circumstances, we would want to minimise FN and FP. This can be done by selection of an appropriate threshold value which is obtained by plotting the Sensitivity and Specificity plots and taking the intersection of these graphs as the optimal point. The logic behind this approach is that the point of intersection of these graphs represents the maximum value for both Sensitivity as well as Specificity which mimimises FN as well as FP

sens.df %>% ggplot(aes(ProbValues,MetricValue)) + 
  geom_line(linetype = "solid",color="blue",lwd = 1.5) + 
  geom_line(data=spec.df, aes(ProbValues,MetricValue,col="red"),color="orange",lwd = 1.5) +
  scale_y_continuous(sec.axis = sec_axis(~., name = "Specificity")) +
  labs(x='Cutoff', y="Sensitivity") +
  theme(axis.title.y.right = element_text(colour = "orange"), legend.position="none") +
  theme(axis.title.y.left = element_text(colour = "blue"), legend.position="none")


The cut off comes around 0.26.Now we will calculate the evaluation metrics such as accuracy, sensitivity etc on test data

Step 6: Generating the Confusion Matrix

Based on the cut-off value of probability score(0.26), let’s generate the confusion matrix for test data


# Getting the confusionmatrix
# Recoding predicted values based on cutoff
cutoff<-0.26

# Calcualting the predicted values on test data set
test.predict<-predict(m1,test.df,type="response")

test.predict.levels<-ifelse(test.predict > cutoff,"Yes","No")
test.predict.levels<-as.factor(test.predict.levels)

# Creating the table
xtab <- table(test.predict.levels, test.df[["Target"]])
xtab
                   
test.predict.levels  No Yes
                No  945  19
                Yes  93 244
# Lets calculate the accuracy
(945+244)/nrow(test.df) 
[1] 0.9139124

Accuracy is 91% which is pretty good

Step 7: Calculating Sensitivity, Spesivity, Precision, Recall and F-Score based on confusion matrix

                   
test.predict.levels  No Yes
                No  945  19
                Yes  93 244
# Sensitivity: TP/TP+FN
sens<-xtab[2,2]/(xtab[2,2]+xtab[2,1])
sens
[1] 0.7240356
# Specificity:TN/(TN+FP)
spec<-xtab[1,1]/(xtab[1,1]+xtab[1,2])
spec
[1] 0.9802905
# Precision: TP/(TP+FN)
prec<-xtab[2,2]/(xtab[2,2]+xtab[1,2])
prec
[1] 0.9277567
# Recall: Same as Sensitivity
# F Score: (2 X Precision X Recall)/(Precision+Recall)
2*sens*spec/(sens+spec)
[1] 0.8328984


Step 8: Plotting the Receiver Operating Characteristics(ROC) and Calculating Area under the curve(AUC)

Receiver Operating Characteristics(ROC) curve is an industry wide standard method to assess the performance of the two level classifier model. It is a plot against true positive rate(sensitivity) and false negative rate(1-specificity). A model with high discrimination ability will have high sensitivity and specificity simultaneously, resulting in an ROC curve that goes close to the top left corner of the plot. A model with no discriminatory power will have an ROC curve with a 45 degree diagonal line

Area under the curve(AUC): One way to assess the ability of the model to differentiate between yes and no is to report the AUC. As already mentioned, a model with high discriminatory power will be close to the left corner and hence have a high AUC. The following rules should be followed to derive insights from AUC:

AUC < 0.5: Model with no differentiation power 0.5 =< AUC < 0.7: Model with good differentiation power *0.7 =< AUC: Model with best differentiation power

With the above rules in mind, we will plot the ROC and calcualte the AUC for our model


[1] 0.9734


Step 9:Gain and Lift Charts

Gain and lift charts are basically used for measuring the benefits of using the model.

Gain is the ratio between the cumulative number of positive observations up to a decile to the total number of positive observations in the data.For example, 60% of the targets covered in top 20%(decile 1 and 2) of the data based on the model.This means that we can identify 60% of the targets by just approaching 20% of the total customers

Lift is the ratio of the number of positive observations up to decile i using the model to the expected number of positives up to that decile i based on a random model.Simply put,it measures how much better we perform with the predictive model as against without using a model.For example, a lift of 3 upto decile 2 means that we can get 3 times the total number of targets in comparison to the ones found by randomly sampling 20% of the data(without using a model)

groups<-10
depvar<-test.df[["TARGET"]]

gain.table<-data.frame(Predicted_prob=round(test.predict,4),
                       Predicted_Labels=test.predict.levels)%>%
  mutate(Gp=ntile(-Predicted_prob, groups))%>%
  group_by(Gp)%>%
  summarise(Sum_Prob=sum(Predicted_prob[which(Predicted_Labels=="Yes")]))%>%
  ungroup()%>%
  mutate(Cum_Sum_Prob=cumsum(Sum_Prob))%>%
  mutate(Gain=100*Cum_Sum_Prob/sum(Cum_Sum_Prob))%>%
  mutate(Lift=Gain/(Gp*(100/groups)))


df2<-gain.table

head(df2)
# A tibble: 6 x 5
     Gp Sum_Prob Cum_Sum_Prob  Gain  Lift
  <int>    <dbl>        <dbl> <dbl> <dbl>
1     1    127.          127.  5.53 0.553
2     2     90.2         217.  9.45 0.473
3     3     27.0         244. 10.6  0.354
4     4      0           244. 10.6  0.266
5     5      0           244. 10.6  0.213
6     6      0           244. 10.6  0.177
# Gain chart
ggplot(df2, aes(Gp,  Gain)) +
  geom_point() +
  geom_line(color="orange",lwd=1) +
  # This makes another geom_line that only sees the first and last row of the data
  geom_line(data = df2 %>% slice(1, n()),linetype = "dashed") +
  scale_y_continuous(breaks = seq(0, 20, by = 10), limits = c(0,20)) +
  scale_x_continuous(breaks = c(1:10)) +
  labs(title = "Cumulative Gains Plot",
       y = "Cumulative Gain %")


We get a gain of ~10% upto 2nd decile which is not so good.Ideally we would expected this to be more than 20%

baseline.df<-data.frame(Gp=c(1,2,4,6,8,10),
                        Lift=c(1,1,1,1,1,1))

ggplot(df2, aes(Gp,  Lift)) +
  geom_point() +
  geom_line(color="orange",lwd=1) +
  geom_line(data = baseline.df %>% slice(1, n()),linetype = "dashed") +
  scale_y_continuous(breaks = seq(0, 1, by = 0.5), limits = c(0,1)) +
  labs(title = "Lift Plot",
       y = "Lift")

Lift values are less than 1 which means that model is not helping much in terms of identifying targets when compared to random sampling

Final Note

In this blog we looked at general introduction to targeted marketing and how we can leverage machine learning to identify customers who can be probable targets.We also looked at various evaluation metrics of a logistic regression model

My Youtube Channel

Embed Shiny

Please wait...