Customer Profiling using RFM
Parag Verma
8th May, 2023
Introduction
Every company creates Products or services or both. These are developed keeping in mind the needs of the customer. Hence it is safe enough to say that the customer has the ultimate say in the success or failure of the Product/Service and eventually of the company. You must have frequently seen that many companies carry out market studies to understand the customer perception of their product, how customer perceive a particular feature or how do customer relate themselves to a particular product. All these help companies fine tune their go to market strategies and align their products/services with the customer needs.
Hence we can say that understanding customer behavior is the key determinant of success or failure of a company. Along these lines we will look at a key method to understand customer shopping behavior.This will enable us to do the following:
- Define customer profiles based on their behavior
- Discuss the framework that predicts customer churn and lifetime transactions
- Tailor marketing efforts using customized loyalty programs to better manage customer base
For our analysis, we will be using the rfm_data_orders transaction data set from rfm library. It has data for 995 customers with details about the total sales(revenue) for each transaction done by the customer.
Step 0: Installing libraries
package.name<-c("dplyr","lubridate","openxlsx","tidyr",
"rfm","stringr","ggplot2")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
Step 1: Importing the dataset
Importing the RFM data set from rfm library
rfm_data_orders
# A tibble: 4,906 × 3
customer_id order_date revenue
<chr> <date> <dbl>
1 Mr. Brion Stark Sr. 2004-12-20 32
2 Ethyl Botsford 2005-05-02 36
3 Hosteen Jacobi 2004-03-06 116
4 Mr. Edw Frami 2006-03-15 99
5 Josef Lemke 2006-08-14 76
6 Julisa Halvorson 2005-05-28 56
7 Judyth Lueilwitz 2005-03-09 108
8 Mr. Mekhi Goyette 2005-09-23 183
9 Hansford Moen PhD 2005-09-07 30
10 Fount Flatley 2006-04-12 13
# … with 4,896 more rows
df<-rfm_data_orders
head(df)
# A tibble: 6 × 3
customer_id order_date revenue
<chr> <date> <dbl>
1 Mr. Brion Stark Sr. 2004-12-20 32
2 Ethyl Botsford 2005-05-02 36
3 Hosteen Jacobi 2004-03-06 116
4 Mr. Edw Frami 2006-03-15 99
5 Josef Lemke 2006-08-14 76
6 Julisa Halvorson 2005-05-28 56
The attributes are as follows:
- customer_id:Customer ID
- order_date: date of sale with year,month and day information
- revenue:Total value of th customer order
For this blog, we will focus on:
- Creating the RFM metrics
- Identify relationship between thes metrics
If you unable to import the data, then you can download the rfm_data_orders.csv dataset form my github repo repository.
Step 2: Extracting the RFM Metrics: An Illustration
The data set is basically a transaction data for every customer.Lets look at one customer id to understand it better.
df%>%
filter(customer_id == "Mr. Brion Stark Sr.")%>%
arrange(desc(order_date))
# A tibble: 8 × 3
customer_id order_date revenue
<chr> <date> <dbl>
1 Mr. Brion Stark Sr. 2006-05-16 75
2 Mr. Brion Stark Sr. 2006-03-31 205
3 Mr. Brion Stark Sr. 2005-08-26 104
4 Mr. Brion Stark Sr. 2005-07-30 212
5 Mr. Brion Stark Sr. 2005-05-17 165
6 Mr. Brion Stark Sr. 2005-04-02 33
7 Mr. Brion Stark Sr. 2005-02-15 26
8 Mr. Brion Stark Sr. 2004-12-20 32
Here the total number of transactions are 8.Hence the Frequency Score is 8.
df%>%
filter(customer_id == "Mr. Brion Stark Sr.")%>%
summarise(Total_Revenue=sum(revenue))
# A tibble: 1 × 1
Total_Revenue
<dbl>
1 852
Total Monetary Value is 852.
Lets now calculate the Recency value which is the total days passed since last purchase.If we assume th reference date as todays date, then the Recency score will be equal to the following
Today’s date(Lets assume we are in Feb 2007 ) - max(order_date)
max_Date<-df%>%
filter(customer_id == "Mr. Brion Stark Sr.")%>%
summarise(Last_Order_date = max(order_date))%>%
select(Last_Order_date)%>%
pull(Last_Order_date)
max_Date
[1] "2006-05-16"
as.numeric(as.Date("2007-02-15") - max_Date)
[1] 275
Hence the RFM metric for this particular customer can be represented as shown in the below table
Customer_ID Recency Frequency Monetary
1 Mr. Brion Stark Sr. 6171 8 275
So we took an example of a particular customer id and calculated the RFM metrics for it.Next we will look at how we can derive the RFM metrics for all th customer ids in the data
Step 3: RFM Metrics for all the customer ids
In this section, we will develop the code to extract the RFM metrics from the customer data set
df_FM<-df%>%
group_by(customer_id)%>%
summarise(Frequency=n(),
Monetary = sum(revenue))
head(df_FM)
# A tibble: 6 × 3
customer_id Frequency Monetary
<chr> <int> <dbl>
1 Abbey O'Reilly DVM 6 472
2 Add Senger 3 340
3 Aden Lesch Sr. 4 405
4 Admiral Senger 5 448
5 Agness O'Keefe 9 843
6 Aileen Barton 9 763
df_R<-df%>%
group_by(customer_id)%>%
summarise(Last_Order_Date=max(order_date))%>%
ungroup()%>%
mutate(Recency = as.numeric(as.Date("2007-02-15") - Last_Order_Date))%>%
select(-Last_Order_Date)
head(df_R)
# A tibble: 6 × 2
customer_id Recency
<chr> <dbl>
1 Abbey O'Reilly DVM 251
2 Add Senger 186
3 Aden Lesch Sr. 240
4 Admiral Senger 178
5 Agness O'Keefe 136
6 Aileen Barton 130
Joining df_FM and df_R to create the final dataset
final.df<-df_FM%>%
inner_join(df_R,by="customer_id")
head(final.df)
# A tibble: 6 × 4
customer_id Frequency Monetary Recency
<chr> <int> <dbl> <dbl>
1 Abbey O'Reilly DVM 6 472 251
2 Add Senger 3 340 186
3 Aden Lesch Sr. 4 405 240
4 Admiral Senger 5 448 178
5 Agness O'Keefe 9 843 136
6 Aileen Barton 9 763 130
Step 4: Lets plot th RFM Metric
ggplot(final.df, aes(x=Frequency , fill="red")) +
geom_histogram(binwidth=1)
The distribution is skewed to the right with max value at 15.Most values ar around 5 and 6.
Lets look at the distributions of Recency and Monetary values as well.
ggplot(final.df, aes(x=Recency , fill="red")) +
geom_histogram(binwidth=50)
ggplot(final.df, aes(x=Monetary , fill="red")) +
geom_histogram(binwidth=50)
We can see that the distributions of R,F and M metrics are skewed towards the right with certain cases of a long tail. Let us now understand the variations by taking Frequency and Recency metrics along with Monetary value.
Step 4a: Frequency and Monetary Profile
This will help us understand how total value of the purchase varies with the frequency of total orders/transactions. One would expect that higher number of orders would result in a higher monetary value
lvls<-as.character(sort(unique(final.df$Frequency)))
final.df$Frequency2<-factor(final.df$Frequency,levels=lvls)
ggplot(final.df, aes(x=Frequency2 , y=Monetary , color="red")) +
geom_boxplot()
We can clearly see that as the total frequency of orders increase, the total monetary value also increases.Apart from the frequency box at 11, this pattern is pretty consistent. Lets check the pattern for Recency as well
Step 4b: Recency and Monetary Profile
# lvls<-as.character(sort(unique(final.df$Recency)))
# final.df$Frequency2<-factor(final.df$Frequency,levels=lvls)
ggplot(final.df, aes(x=Recency , y=Monetary , color="red")) +
geom_point() +
geom_smooth(method=lm, se=FALSE)
`geom_smooth()` using formula = 'y ~ x'
Recent purchases(low value of Recency) tends to have more monetary value in comparison to purchases made some time back
Lets look at all the three metrics together.For this we would have to convert Recency and Frequency into buckets and then for each of these buckets we want to look at the total or average monetary value.
Step 5: Converting Frequency and Monetary values into Buckets
Wee will be converting them into buckets based on percentiles. So the grouping will be based on the following:
- 0-25th Percentile as Score 1
- 25-50th Percentile as Score 2
- 50-75th Percentile as Score 3
- 75-100th Percentile as Score 4
Obviously a higher frequency value is indicated by a higher score.
This grouping will be modified a little for recency as higher recency value would mean that th customer purchased a long time back.So recent purchase value(indicated by a low value and hence low percentile) will be given a high score
final.df2<-final.df%>%
mutate(Frequency_Group = ifelse(Frequency <= quantile(Frequency,0.25),1,
ifelse(Frequency <= quantile(Frequency,0.50),2,
ifelse(Frequency <= quantile(Frequency,0.75),3,4))))%>%
mutate(Recency_Group = ifelse(Recency <= quantile(Recency,0.25),4,
ifelse(Recency <= quantile(Recency,0.50),3,
ifelse(Recency <= quantile(Recency,0.75),2,1))))%>%
select(-Frequency2)
head(final.df2)
# A tibble: 6 × 6
customer_id Frequency Monetary Recency Frequency_Group Recency_Group
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Abbey O'Reilly DVM 6 472 251 3 3
2 Add Senger 3 340 186 1 3
3 Aden Lesch Sr. 4 405 240 2 3
4 Admiral Senger 5 448 178 2 4
5 Agness O'Keefe 9 843 136 4 4
6 Aileen Barton 9 763 130 4 4
Now for each of the combinations of Frequency_Group X Recency_Group, we want to look at average monetary value
final.df3<-final.df2%>%
group_by(Frequency_Group,Recency_Group)%>%
summarise(Average_Monetary=mean(Monetary))%>%
ungroup()
`summarise()` has grouped output by 'Frequency_Group'. You can override using
the `.groups` argument.
head(final.df3)
# A tibble: 6 × 3
Frequency_Group Recency_Group Average_Monetary
<dbl> <dbl> <dbl>
1 1 1 212.
2 1 2 257.
3 1 3 227.
4 1 4 242.
5 2 1 449.
6 2 2 440.
ggplot(final.df3, aes(x = Frequency_Group,
y = Recency_Group, fill = Average_Monetary)) +
geom_tile(color = "white",
lwd = 1.5,
linetype = 1) +
coord_fixed()
Higher monetary value is concentrated in high frequency groups and is independent of the recency factor
So far we have analysed the relationship between the RFM metrics and how the monetary values tend to be distributed with different values of Recency and Frequency. In most of the blogs that are available online, you will see that they just leverage RFM and create customer segments based on this. But we need some additional piece of information apart from RFM to be able to identify segments.These are churn and Lifetime value.In the next blogs we will use survival models and BTYD(Buy till you die) to calculate these metrcs and then group the customers into logical segments.