Reset a running counter within groupby in R

Introduction

There are certain situation where we are required to identify the number of consecutive weeks of share price decline.These cases can be modeled by resetting a running counter in R

In this blog we will look at how dplyr and rleid function from data.table library can be used to generate such summaries.We will be using the weeky price of avacado sold in US.This dataset is available in ‘avocado’ library

package.name<-c("dplyr","data.table","avocado")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

Step 1: Importing the dataset

We are creating a dummy data frame with details of daily Sales for 100 products

data('hass')
df<-hass
head(df)

# A tibble: 6 x 17
  week_ending         location   region avg_price_nonorg plu4046 plu4225 plu4770
  <dttm>              <chr>      <chr>             <dbl>   <dbl>   <dbl>   <dbl>
1 2017-01-02 00:00:00 Albany     North~             1.47   4846. 117027.    200.
2 2017-01-02 00:00:00 Atlanta    South~             0.93 224074. 118926.    337.
3 2017-01-02 00:00:00 Baltimore~ Midso~             1.47  54530. 408952.  14387.
4 2017-01-02 00:00:00 Boise      West               0.92  27845.   9409.  11342.
5 2017-01-02 00:00:00 Boston     North~             1.29   4120. 371223.   3934.
6 2017-01-02 00:00:00 Buffalo/R~ North~             1.43   1285.  58532.    103.
# ... with 10 more variables: small_nonorg_bag <dbl>, large_nonorg_bag <dbl>,
#   xlarge_nonorg_bag <dbl>, avg_price_org <dbl>, plu94046 <dbl>,
#   plu94225 <dbl>, plu94770 <dbl>, small_org_bag <dbl>, large_org_bag <dbl>,
#   xlarge_org_bag <dbl>

In the above dataset, we will look at average price on non organic avocado across various regions for different week ending.We will look at the maximum number of weeks for which there was a reduction in price for a given region

Step 2: Developing the Logic

To start with, we will try and develop the logic for on location say ‘Albany’.The basic steps for the process would be the following:

1.Group by at the location level
2.Take the Lag of the Price/Sales etc
3.Wherever the Difference between Price and Lagged Price is > 0, then label it as 0, otherwise 1
4.Create a running counter based on the value of Difference
5.Filter on Difference==1
6.Now generate the total count of records grouped by location,Difference and Counter
7.For each location, sort based on descending value on the above total count
8.The first row for each location is the maximum weeks of continuous decrease for Price/Sales

df.albany<-df%>%
  filter(location=="Albany")%>%
  select(week_ending,location,avg_price_nonorg)%>%
  arrange(location,week_ending)%>%
  group_by(location)%>% #1
  mutate(avg_price_nonorg_lag=lag(avg_price_nonorg))%>% #2
  filter(!is.na(avg_price_nonorg_lag))%>%
  mutate(Diff=round(avg_price_nonorg-avg_price_nonorg_lag,2))%>% #3
  mutate(Diff_Flag=ifelse(Diff > 0,0,1))%>% 
  mutate(R_Cntr=data.table::rleid(Diff_Flag>0))%>% #4
  ungroup()%>%
  filter(Diff_Flag ==1)%>% #5
  group_by(location,Diff_Flag,R_Cntr)%>%
  summarise(Total_Instances=n())%>% #6
  ungroup()%>%
  arrange(location,desc(Total_Instances))%>% #7
  group_by(location)%>%
  mutate(Index=1:n())%>%
  filter(Index==1)%>% #8
  select(location,Total_Instances)

`summarise()` has grouped output by 'location', 'Diff_Flag'. You can override using the `.groups` argument.

df.albany

# A tibble: 1 x 2
# Groups:   location [1]
  location Total_Instances
  <chr>              <int>
1 Albany                 4

Total Instances as 5 means that there were 5 continuous weeks of Priced declined.Now lets get the details of those 4 weeks.

Step 3: Getting the corresponding Dates of continuous decline

We will get the details of location and R_Cntr(Counter created in the above code) and then map it back with the portion of the data to get the details

location_key<-df%>%
  filter(location=="Albany")%>%
  select(week_ending,location,avg_price_nonorg)%>%
  arrange(location,week_ending)%>%
  group_by(location)%>% #1
  mutate(avg_price_nonorg_lag=lag(avg_price_nonorg))%>% #2
  filter(!is.na(avg_price_nonorg_lag))%>%
  mutate(Diff=round(avg_price_nonorg-avg_price_nonorg_lag,2))%>% #3
  mutate(Diff_Flag=ifelse(Diff > 0,0,1))%>% 
  mutate(R_Cntr=data.table::rleid(Diff_Flag>0))%>% #4
  ungroup()%>%
  filter(Diff_Flag ==1)%>% #5
  group_by(location,Diff_Flag,R_Cntr)%>%
  summarise(Total_Instances=n())%>% #6
  ungroup()%>%
  arrange(location,desc(Total_Instances))%>% #7
  group_by(location)%>%
  mutate(Index=1:n())%>%
  filter(Index==1)%>% #8
  select(location,R_Cntr)

`summarise()` has grouped output by 'location', 'Diff_Flag'. You can override using the `.groups` argument.

location_key

# A tibble: 1 x 2
# Groups:   location [1]
  location R_Cntr
  <chr>     <int>
1 Albany       18

time_duration<-df%>%
  filter(location=="Albany")%>%
  select(week_ending,location,avg_price_nonorg)%>%
  arrange(location,week_ending)%>%
  group_by(location)%>% #1
  mutate(avg_price_nonorg_lag=lag(avg_price_nonorg))%>% #2
  filter(!is.na(avg_price_nonorg_lag))%>%
  mutate(Diff=round(avg_price_nonorg-avg_price_nonorg_lag,2))%>% #3
  mutate(Diff_Flag=ifelse(Diff > 0,0,1))%>% 
  mutate(R_Cntr=data.table::rleid(Diff_Flag>0))%>%
  inner_join(location_key,by=c("location","R_Cntr"))%>%
  select(week_ending,location,avg_price_nonorg)

time_duration

# A tibble: 4 x 3
# Groups:   location [1]
  week_ending         location avg_price_nonorg
  <dttm>              <chr>               <dbl>
1 2017-06-04 00:00:00 Albany               1.7 
2 2017-06-11 00:00:00 Albany               1.65
3 2017-06-18 00:00:00 Albany               1.58
4 2017-06-25 00:00:00 Albany               1.53

In this blog we created the logic for a single location.We can use the above code snippets and extend it to other locations as well.

Link to Previous Blogs

https://www.aimlmadeeasy.com

My Youtube Channel

Link

Machine Learning Made Easy

Wednesday, March 9, 2022

Get the Maximum weeks of continuous price decline