Reset a running counter within groupby in R
Parag Verma
10th March, 2022
Introduction
There are certain situation where we are required to identify the number of consecutive weeks of share price decline.These cases can be modeled by resetting a running counter in R
In this blog we will look at how dplyr and rleid function from data.table library can be used to generate such summaries.We will be using the weeky price of avacado sold in US.This dataset is available in ‘avocado’ library
package.name<-c("dplyr","data.table","avocado")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
Step 1: Importing the dataset
We are creating a dummy data frame with details of daily Sales for 100 products
data('hass')
df<-hass
head(df)
# A tibble: 6 x 17
week_ending location region avg_price_nonorg plu4046 plu4225 plu4770
<dttm> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2017-01-02 00:00:00 Albany North~ 1.47 4846. 117027. 200.
2 2017-01-02 00:00:00 Atlanta South~ 0.93 224074. 118926. 337.
3 2017-01-02 00:00:00 Baltimore~ Midso~ 1.47 54530. 408952. 14387.
4 2017-01-02 00:00:00 Boise West 0.92 27845. 9409. 11342.
5 2017-01-02 00:00:00 Boston North~ 1.29 4120. 371223. 3934.
6 2017-01-02 00:00:00 Buffalo/R~ North~ 1.43 1285. 58532. 103.
# ... with 10 more variables: small_nonorg_bag <dbl>, large_nonorg_bag <dbl>,
# xlarge_nonorg_bag <dbl>, avg_price_org <dbl>, plu94046 <dbl>,
# plu94225 <dbl>, plu94770 <dbl>, small_org_bag <dbl>, large_org_bag <dbl>,
# xlarge_org_bag <dbl>
In the above dataset, we will look at average price on non organic avocado across various regions for different week ending.We will look at the maximum number of weeks for which there was a reduction in price for a given region
Step 2: Developing the Logic
To start with, we will try and develop the logic for on location say ‘Albany’.The basic steps for the process would be the following:
- 1.Group by at the location level
- 2.Take the Lag of the Price/Sales etc
- 3.Wherever the Difference between Price and Lagged Price is > 0, then label it as 0, otherwise 1
- 4.Create a running counter based on the value of Difference
- 5.Filter on Difference==1
- 6.Now generate the total count of records grouped by location,Difference and Counter
- 7.For each location, sort based on descending value on the above total count
- 8.The first row for each location is the maximum weeks of continuous decrease for Price/Sales
df.albany<-df%>%
filter(location=="Albany")%>%
select(week_ending,location,avg_price_nonorg)%>%
arrange(location,week_ending)%>%
group_by(location)%>% #1
mutate(avg_price_nonorg_lag=lag(avg_price_nonorg))%>% #2
filter(!is.na(avg_price_nonorg_lag))%>%
mutate(Diff=round(avg_price_nonorg-avg_price_nonorg_lag,2))%>% #3
mutate(Diff_Flag=ifelse(Diff > 0,0,1))%>%
mutate(R_Cntr=data.table::rleid(Diff_Flag>0))%>% #4
ungroup()%>%
filter(Diff_Flag ==1)%>% #5
group_by(location,Diff_Flag,R_Cntr)%>%
summarise(Total_Instances=n())%>% #6
ungroup()%>%
arrange(location,desc(Total_Instances))%>% #7
group_by(location)%>%
mutate(Index=1:n())%>%
filter(Index==1)%>% #8
select(location,Total_Instances)
`summarise()` has grouped output by 'location', 'Diff_Flag'. You can override using the `.groups` argument.
df.albany
# A tibble: 1 x 2
# Groups: location [1]
location Total_Instances
<chr> <int>
1 Albany 4
Total Instances as 5 means that there were 5 continuous weeks of Priced declined.Now lets get the details of those 4 weeks.
Step 3: Getting the corresponding Dates of continuous decline
We will get the details of location and R_Cntr(Counter created in the above code) and then map it back with the portion of the data to get the details
location_key<-df%>%
filter(location=="Albany")%>%
select(week_ending,location,avg_price_nonorg)%>%
arrange(location,week_ending)%>%
group_by(location)%>% #1
mutate(avg_price_nonorg_lag=lag(avg_price_nonorg))%>% #2
filter(!is.na(avg_price_nonorg_lag))%>%
mutate(Diff=round(avg_price_nonorg-avg_price_nonorg_lag,2))%>% #3
mutate(Diff_Flag=ifelse(Diff > 0,0,1))%>%
mutate(R_Cntr=data.table::rleid(Diff_Flag>0))%>% #4
ungroup()%>%
filter(Diff_Flag ==1)%>% #5
group_by(location,Diff_Flag,R_Cntr)%>%
summarise(Total_Instances=n())%>% #6
ungroup()%>%
arrange(location,desc(Total_Instances))%>% #7
group_by(location)%>%
mutate(Index=1:n())%>%
filter(Index==1)%>% #8
select(location,R_Cntr)
`summarise()` has grouped output by 'location', 'Diff_Flag'. You can override using the `.groups` argument.
location_key
# A tibble: 1 x 2
# Groups: location [1]
location R_Cntr
<chr> <int>
1 Albany 18
time_duration<-df%>%
filter(location=="Albany")%>%
select(week_ending,location,avg_price_nonorg)%>%
arrange(location,week_ending)%>%
group_by(location)%>% #1
mutate(avg_price_nonorg_lag=lag(avg_price_nonorg))%>% #2
filter(!is.na(avg_price_nonorg_lag))%>%
mutate(Diff=round(avg_price_nonorg-avg_price_nonorg_lag,2))%>% #3
mutate(Diff_Flag=ifelse(Diff > 0,0,1))%>%
mutate(R_Cntr=data.table::rleid(Diff_Flag>0))%>%
inner_join(location_key,by=c("location","R_Cntr"))%>%
select(week_ending,location,avg_price_nonorg)
time_duration
# A tibble: 4 x 3
# Groups: location [1]
week_ending location avg_price_nonorg
<dttm> <chr> <dbl>
1 2017-06-04 00:00:00 Albany 1.7
2 2017-06-11 00:00:00 Albany 1.65
3 2017-06-18 00:00:00 Albany 1.58
4 2017-06-25 00:00:00 Albany 1.53
In this blog we created the logic for a single location.We can use the above code snippets and extend it to other locations as well.
Interesting post, thanks for sharing.
ReplyDelete