Customer Visit Information
2025-03-11
Introduction
Many a times we are given messy data and asked to derive customer visit information from it.For instance, we might be provided with a data dump with no date field.In such cases, identification of visit information becomes a challenge.In this blog I have tried to demonstrate how we can be a little smarter in deriving such information.
Step 1: Importing the libraries
package.name<-c("tidyverse","data.table")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
Loading required package: tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: data.table
Attaching package: 'data.table'
The following objects are masked from 'package:lubridate':
hour, isoweek, mday, minute, month, quarter, second, wday, week,
yday, year
The following objects are masked from 'package:dplyr':
between, first, last
The following object is masked from 'package:purrr':
transpose
Step 2:Creating the illustrative data
dummy.data<-data.frame(Name=c("Roger","Roger","Roger",
"Nadal","Nadal",
"Roger","Roger"),
Item=c("Naan","Lassi","Gulabjamun",
"Roti","Daal",
"Lemon Juice","Ice Cream"))
dummy.data
Name Item
1 Roger Naan
2 Roger Lassi
3 Roger Gulabjamun
4 Nadal Roti
5 Nadal Daal
6 Roger Lemon Juice
7 Roger Ice Cream
Lets say that the above data represents items orderered by different people of various visits.By looking at the data, we can observe the following things:
- In one of Roger’s visit, he order three items
- Nadal order two items in his visit
- In his second visit, Roger order 2 items
Now we know that First visit was done by Roger on lets say day1. Nadal might have also visited the same day or other day.And then finally, Roger visited again on a separate day.So Roger visited twice while Nadal only visited once.While it is easy to deduce this for few observations, but what if the data runs into several thousands of records.In such case we would need a logic to identify visit info and eventually create visit profile.
Step 3: Creating the logic
Difference of Row Number at customer level
row.diff<-dummy.data%>%
mutate(Row_Num=1:n())%>%
arrange(Row_Num)%>%
group_by(Name)%>%
mutate(Lag_Row_Num=lag(Row_Num))%>%
ungroup()%>%
mutate(Diff = Row_Num - Lag_Row_Num)
row.diff
# A tibble: 7 × 5
Name Item Row_Num Lag_Row_Num Diff
<chr> <chr> <int> <int> <int>
1 Roger Naan 1 NA NA
2 Roger Lassi 2 1 1
3 Roger Gulabjamun 3 2 1
4 Nadal Roti 4 NA NA
5 Nadal Daal 5 4 1
6 Roger Lemon Juice 6 3 3
7 Roger Ice Cream 7 6 1
If Diff is NA or greater than 1, then it represents new visit
new.visit<-row.diff%>%
mutate(Cntr = ifelse(Diff == 1,0,"New Visit"))%>%
mutate(Cntr2=ifelse(is.na(Cntr),"New Visit",Cntr))%>%
mutate(Cntr3=ifelse(Cntr2==0,NA,Cntr2))%>%
dplyr::select(-Cntr,-Lag_Row_Num)
new.visit
# A tibble: 7 × 6
Name Item Row_Num Diff Cntr2 Cntr3
<chr> <chr> <int> <int> <chr> <chr>
1 Roger Naan 1 NA New Visit New Visit
2 Roger Lassi 2 1 0 <NA>
3 Roger Gulabjamun 3 1 0 <NA>
4 Nadal Roti 4 NA New Visit New Visit
5 Nadal Daal 5 1 0 <NA>
6 Roger Lemon Juice 6 3 New Visit New Visit
7 Roger Ice Cream 7 1 0 <NA>
Creating Index based on Name and Cntr3
index.df<-new.visit%>%
group_by(Name,Cntr3)%>%
mutate(Index=1:n())%>%
mutate(Visit_Info=ifelse(Cntr3==0,0,Index))%>%
dplyr::select(-Index,-Cntr2,-Cntr3)%>%
ungroup()
Adding missing grouping variables: `Cntr3`
index.df
# A tibble: 7 × 6
Cntr3 Name Item Row_Num Diff Visit_Info
<chr> <chr> <chr> <int> <int> <int>
1 New Visit Roger Naan 1 NA 1
2 <NA> Roger Lassi 2 1 NA
3 <NA> Roger Gulabjamun 3 1 NA
4 New Visit Nadal Roti 4 NA 1
5 <NA> Nadal Daal 5 1 NA
6 New Visit Roger Lemon Juice 6 3 2
7 <NA> Roger Ice Cream 7 1 NA
Filling up visit info(similar to ctrl + D in excel)
fill.df<-index.df%>%
group_by(Name)%>%
fill(Visit_Info)%>%
mutate(Visit_Details=str_c("Visit # ",Visit_Info))%>%
ungroup()%>%
arrange(Row_Num)
fill.df
# A tibble: 7 × 7
Cntr3 Name Item Row_Num Diff Visit_Info Visit_Details
<chr> <chr> <chr> <int> <int> <int> <chr>
1 New Visit Roger Naan 1 NA 1 Visit # 1
2 <NA> Roger Lassi 2 1 1 Visit # 1
3 <NA> Roger Gulabjamun 3 1 1 Visit # 1
4 New Visit Nadal Roti 4 NA 1 Visit # 1
5 <NA> Nadal Daal 5 1 1 Visit # 1
6 New Visit Roger Lemon Juice 6 3 2 Visit # 2
7 <NA> Roger Ice Cream 7 1 2 Visit # 2
Cleaning the data a little
visit.clean<-fill.df%>%
select(-Cntr3,-Diff)
visit.clean
# A tibble: 7 × 5
Name Item Row_Num Visit_Info Visit_Details
<chr> <chr> <int> <int> <chr>
1 Roger Naan 1 1 Visit # 1
2 Roger Lassi 2 1 Visit # 1
3 Roger Gulabjamun 3 1 Visit # 1
4 Nadal Roti 4 1 Visit # 1
5 Nadal Daal 5 1 Visit # 1
6 Roger Lemon Juice 6 2 Visit # 2
7 Roger Ice Cream 7 2 Visit # 2
No comments:
Post a Comment