Retail Targeting using Machine Learning
Parag Verma
19th Jan, 2022
Introduction
Targeting is usually done as part of Marketing exercise where customers are divided into segments and then specific activities are designed for each segment so that customers responds to these drives.In the current blog, we will look at a retails data and try to use Machine Learning to identify potential customers that should be targeted as part of promotional activities
Step 1: Read Data
The onlineretail dataset is part of onlineretail package. Packages such as ROCR and plotROC are imported to calculate certain evaluation metrics post modelling exercise. Lets import the dataset and look at the first few records
InvoiceNo StockCode Description Quantity
1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
2 536365 71053 WHITE METAL LANTERN 6
3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
InvoiceDate UnitPrice CustomerID Country
1 2010-12-01 08:26:00 2.55 17850 United Kingdom
2 2010-12-01 08:26:00 3.39 17850 United Kingdom
3 2010-12-01 08:26:00 2.75 17850 United Kingdom
4 2010-12-01 08:26:00 3.39 17850 United Kingdom
5 2010-12-01 08:26:00 3.39 17850 United Kingdom
6 2010-12-01 08:26:00 7.65 17850 United Kingdom
The brief description of the columns in onlineretail are given below:
- InvoiceNo: Integral number uniquely assigned to each transaction
- StockCode: A number uniquely assigned to each product
- Description: Product Description
- Quantity: Units order for every transaction
- InvoiceDate: Date of making the transaction
- UnitPrice: dollars amount for each unit
- CustomerID: A number uniquely assigned to each customer
- Country: The country where a customer resides
In this analysis, we will identify most valued customers by using their total purchased value as the proxy to create a Binary indicator.We will then use a logistic regression to assign probability to each customer. Customers with high probability should be targeted in the next marketing drive
Step 1b: Create additional Features
We will create the following features from the dataset:
- Frequency: Total number of transaction made by the customer
- Total_Value: Total dollar value for the transactions done by the customer
- Time_Spread: Total time difference between first and last transaction of a customer
- Target: If the total dollar value for a customer is greater than the 80th percentiale of Overall dollar value, then it is “Yes”, otherwise “No”
interim.df<-df%>%
arrange(Country,InvoiceDate)%>%
group_by(CustomerID,Country)%>%
summarise(Min_Date=min(InvoiceDate),
Max_Date=max(InvoiceDate),
Frequency=n(),
Total_Quantity=sum(Quantity),
Mean_Price=mean(UnitPrice))%>%
ungroup()%>%
mutate(Total_Value=Total_Quantity*Mean_Price,
Time_Spread=Max_Date-Min_Date)%>%
mutate(Percentile_80=quantile(Total_Value,0.80))%>%
mutate(Target=ifelse(Total_Value > Percentile_80,"Yes","No" ))%>%
select(Country,Frequency,Total_Quantity,Total_Value,Time_Spread,Target)%>%
as.data.frame()%>%
mutate(Total_Quantity2=ifelse(Total_Quantity <= 0,1,Total_Quantity))%>%
filter(Total_Value > 0)
head(interim.df)
Country Frequency Total_Quantity Total_Value Time_Spread Target
1 Iceland 182 2458 6498.9790 31539300 secs Yes
2 Finland 31 2341 13495.4874 24429840 secs Yes
3 Italy 73 631 5230.3849 0 secs Yes
4 Norway 17 197 756.7118 0 secs No
5 Norway 95 470 10939.1263 22471440 secs Yes
6 Bahrain 4 20 121.5000 0 secs No
Total_Quantity2
1 2458
2 2341
3 631
4 197
5 470
6 20
We have removed customer who have negative Total Quantity.These are basically return orders
Time_Spread currently is in seconds.Lets convert it into days
time_spread<-as.numeric(round(interim.df$Time_Spread/(24*60*60)))
time_spread[1:5]
[1] 365 283 0 0 260
interim.df$Time_Spread2<-time_spread
Step 2: Exploratory Data Analyses
Here we will try to analyse the following:
- Unique values in the ‘Frequency’ and ‘Target’ variable
- Mean value of ‘Total_Value’,‘Total_Quantity2’ and ‘Time_Spread2’ variable across unique values of ‘Target’ variable
2a: Unique values in the ‘Target’ variable
table(interim.df[["Target"]])
No Yes
3456 878
Proportion of customers who purchase is 878/(878+3456) ~ 20% Even though this is typically way less than this(around 1-5%) we will use this as a base to understand some of the key metrics in targeted marketing
2b: Unique values in the ‘Frequency’ variable
Lets look at min andmax dates first
df%>%
summarise(Min_Date=min(InvoiceDate),
Max_Date=max(InvoiceDate))
Min_Date Max_Date
1 2010-12-01 08:26:00 2011-12-09 12:50:00
So we are looking at 1 year of data.
interim.df%>%
group_by(Frequency)%>%
summarise(Total_Customer=n())%>%
arrange(desc(Total_Customer))%>%
head(10)
# A tibble: 10 x 2
Frequency Total_Customer
<int> <int>
1 6 78
2 10 74
3 7 72
4 12 72
5 11 70
6 8 68
7 9 67
8 16 66
9 21 66
10 19 62
Maximum numbers of customers have purchased 6 to 12 times in the year
2c: Mean value of ‘Total_Value’,‘Total_Quantity2’ and ‘Time_Spread2’ variable across unique values of ‘Target’ variable
eda.df<-interim.df%>%
group_by(Target)%>%
summarise(TotalRecords=n(),
Total_Quantity=mean(Total_Quantity2),
Total_Value=mean(Total_Value),
Time_Spread=mean(Time_Spread2))%>%
gather(key='Variable',value='Mean',Total_Quantity:Time_Spread)
lapply(c("Total_Quantity","Total_Value","Time_Spread"),function(x){
p<-ggplot(data=eda.df%>%
as.data.frame()%>%
filter(Variable==x), aes(x=Target, y=Mean)) +
geom_bar(stat="identity",fill = "orange")+
xlab("Target")+ylab(paste0("Mean Value of ",x))
return(p)
})
[[1]]
[[2]]
[[3]]
We can see that there is a clear difference between mean levels of Total Value,Total Spread2 and Total Quantity2.So these variables should be used to model the potential customers.Also please note that since Target is derived from Total Value, it will not be used in the model
Step 3: Splitting into Train(70%) and Test(30%) Dataset
One should note that the ratio of Yes and No in the target variable should remain almost the same in the test and train datasets.
Yes
0.2027695
Yes
0.2021522
We can see that the proportion of Yes in both the test and train dataset in roughly the same. Hence these two datasets have been created correctly
Step 4: Training Model
Since the range of numeric variables is huge, hence we will take log of Frequency and Total_Quantity2 to improve fit
Call:
glm(formula = Target ~ log(Frequency) + log(Total_Quantity2) +
Time_Spread2, family = "binomial", data = train.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.2763 -0.1770 -0.0311 -0.0018 4.9822
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.763e+01 1.326e+00 -20.839 < 2e-16 ***
log(Frequency) 3.736e-01 1.031e-01 3.625 0.000289 ***
log(Total_Quantity2) 3.570e+00 1.935e-01 18.457 < 2e-16 ***
Time_Spread2 1.728e-03 8.033e-04 2.151 0.031470 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3058.59 on 3032 degrees of freedom
Residual deviance: 982.99 on 3029 degrees of freedom
AIC: 990.99
Number of Fisher Scoring iterations: 8
The results can be interpreted as:
- Frequency has a positive impact of prospective customer identification.If a customer purchases more frequently, then he should be segmented into high valued segment and targeted accordingly
- Similar inference can be made around Total_Quantity2 and Time_Spread2
Step 5: Getting the optimal threshold value
Normally the aim of all model building exercise is to maximise accuracy. However, we need to be careful about how many False positive(FP) and False negative(FN) cases are generated by the model. In such circumstances, we would want to minimise FN and FP. This can be done by selection of an appropriate threshold value which is obtained by plotting the Sensitivity and Specificity plots and taking the intersection of these graphs as the optimal point. The logic behind this approach is that the point of intersection of these graphs represents the maximum value for both Sensitivity as well as Specificity which mimimises FN as well as FP
sens.df %>% ggplot(aes(ProbValues,MetricValue)) +
geom_line(linetype = "solid",color="blue",lwd = 1.5) +
geom_line(data=spec.df, aes(ProbValues,MetricValue,col="red"),color="orange",lwd = 1.5) +
scale_y_continuous(sec.axis = sec_axis(~., name = "Specificity")) +
labs(x='Cutoff', y="Sensitivity") +
theme(axis.title.y.right = element_text(colour = "orange"), legend.position="none") +
theme(axis.title.y.left = element_text(colour = "blue"), legend.position="none")
The cut off comes around 0.26.Now we will calculate the evaluation metrics such as accuracy, sensitivity etc on test data
Step 6: Generating the Confusion Matrix
Based on the cut-off value of probability score(0.26), let’s generate the confusion matrix for test data
# Getting the confusionmatrix
# Recoding predicted values based on cutoff
cutoff<-0.26
# Calcualting the predicted values on test data set
test.predict<-predict(m1,test.df,type="response")
test.predict.levels<-ifelse(test.predict > cutoff,"Yes","No")
test.predict.levels<-as.factor(test.predict.levels)
# Creating the table
xtab <- table(test.predict.levels, test.df[["Target"]])
xtab
test.predict.levels No Yes
No 945 19
Yes 93 244
# Lets calculate the accuracy
(945+244)/nrow(test.df)
[1] 0.9139124
Accuracy is 91% which is pretty good
Step 7: Calculating Sensitivity, Spesivity, Precision, Recall and F-Score based on confusion matrix
test.predict.levels No Yes
No 945 19
Yes 93 244
# Sensitivity: TP/TP+FN
sens<-xtab[2,2]/(xtab[2,2]+xtab[2,1])
sens
[1] 0.7240356
# Specificity:TN/(TN+FP)
spec<-xtab[1,1]/(xtab[1,1]+xtab[1,2])
spec
[1] 0.9802905
# Precision: TP/(TP+FN)
prec<-xtab[2,2]/(xtab[2,2]+xtab[1,2])
prec
[1] 0.9277567
# Recall: Same as Sensitivity
# F Score: (2 X Precision X Recall)/(Precision+Recall)
2*sens*spec/(sens+spec)
[1] 0.8328984
Step 8: Plotting the Receiver Operating Characteristics(ROC) and Calculating Area under the curve(AUC)
Receiver Operating Characteristics(ROC) curve is an industry wide standard method to assess the performance of the two level classifier model. It is a plot against true positive rate(sensitivity) and false negative rate(1-specificity). A model with high discrimination ability will have high sensitivity and specificity simultaneously, resulting in an ROC curve that goes close to the top left corner of the plot. A model with no discriminatory power will have an ROC curve with a 45 degree diagonal line
Area under the curve(AUC): One way to assess the ability of the model to differentiate between yes and no is to report the AUC. As already mentioned, a model with high discriminatory power will be close to the left corner and hence have a high AUC. The following rules should be followed to derive insights from AUC:
AUC < 0.5: Model with no differentiation power 0.5 =< AUC < 0.7: Model with good differentiation power *0.7 =< AUC: Model with best differentiation power
With the above rules in mind, we will plot the ROC and calcualte the AUC for our model
[1] 0.9734
Step 9:Gain and Lift Charts
Gain and lift charts are basically used for measuring the benefits of using the model.
Gain is the ratio between the cumulative number of positive observations up to a decile to the total number of positive observations in the data.For example, 60% of the targets covered in top 20%(decile 1 and 2) of the data based on the model.This means that we can identify 60% of the targets by just approaching 20% of the total customers
Lift is the ratio of the number of positive observations up to decile i using the model to the expected number of positives up to that decile i based on a random model.Simply put,it measures how much better we perform with the predictive model as against without using a model.For example, a lift of 3 upto decile 2 means that we can get 3 times the total number of targets in comparison to the ones found by randomly sampling 20% of the data(without using a model)
groups<-10
depvar<-test.df[["TARGET"]]
gain.table<-data.frame(Predicted_prob=round(test.predict,4),
Predicted_Labels=test.predict.levels)%>%
mutate(Gp=ntile(-Predicted_prob, groups))%>%
group_by(Gp)%>%
summarise(Sum_Prob=sum(Predicted_prob[which(Predicted_Labels=="Yes")]))%>%
ungroup()%>%
mutate(Cum_Sum_Prob=cumsum(Sum_Prob))%>%
mutate(Gain=100*Cum_Sum_Prob/sum(Cum_Sum_Prob))%>%
mutate(Lift=Gain/(Gp*(100/groups)))
df2<-gain.table
head(df2)
# A tibble: 6 x 5
Gp Sum_Prob Cum_Sum_Prob Gain Lift
<int> <dbl> <dbl> <dbl> <dbl>
1 1 127. 127. 5.53 0.553
2 2 90.2 217. 9.45 0.473
3 3 27.0 244. 10.6 0.354
4 4 0 244. 10.6 0.266
5 5 0 244. 10.6 0.213
6 6 0 244. 10.6 0.177
# Gain chart
ggplot(df2, aes(Gp, Gain)) +
geom_point() +
geom_line(color="orange",lwd=1) +
# This makes another geom_line that only sees the first and last row of the data
geom_line(data = df2 %>% slice(1, n()),linetype = "dashed") +
scale_y_continuous(breaks = seq(0, 20, by = 10), limits = c(0,20)) +
scale_x_continuous(breaks = c(1:10)) +
labs(title = "Cumulative Gains Plot",
y = "Cumulative Gain %")
We get a gain of ~10% upto 2nd decile which is not so good.Ideally we would expected this to be more than 20%
baseline.df<-data.frame(Gp=c(1,2,4,6,8,10),
Lift=c(1,1,1,1,1,1))
ggplot(df2, aes(Gp, Lift)) +
geom_point() +
geom_line(color="orange",lwd=1) +
geom_line(data = baseline.df %>% slice(1, n()),linetype = "dashed") +
scale_y_continuous(breaks = seq(0, 1, by = 0.5), limits = c(0,1)) +
labs(title = "Lift Plot",
y = "Lift")
Lift values are less than 1 which means that model is not helping much in terms of identifying targets when compared to random sampling
Final Note
In this blog we looked at general introduction to targeted marketing and how we can leverage machine learning to identify customers who can be probable targets.We also looked at various evaluation metrics of a logistic regression model