Sunday, March 6, 2022

Running Counter in R

Running counter within groupby in R


Introduction

There are certain situation where we are required to create a running counter either in ascending order or descending order to filter out certain records in a group by condition.For example:

  • Identify the top N salaries of each department in a company
  • Identify first and the last value from each group

In this blog we will look at how dplyr can be used to generate such summaries

package.name<-c("dplyr","stringr")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

Step 1: Creating a data frame

We are creating a dummy data frame with details of daily Sales for 100 products

df<-data.frame(Product_Line=sample(c("Prod1","Prod2","Prod3"),10,T),
               Sales=rnorm(10,15,6)
               )
head(df)
  Product_Line    Sales
1        Prod2 27.86048
2        Prod2 11.10242
3        Prod1 24.76522
4        Prod3 18.75677
5        Prod3 13.61866
6        Prod1 19.36591


Step 2: Creating a running counter based on Product groupby and Sales

df2<-df%>%
  arrange(Product_Line,Sales)%>%
  group_by(Product_Line)%>%
  mutate(Index1=1:n(),
         Index2=n():1)

df2
# A tibble: 10 x 4
# Groups:   Product_Line [3]
   Product_Line Sales Index1 Index2
   <chr>        <dbl>  <int>  <int>
 1 Prod1         19.4      1      2
 2 Prod1         24.8      2      1
 3 Prod2         11.1      1      5
 4 Prod2         16.6      2      4
 5 Prod2         17.7      3      3
 6 Prod2         19.1      4      2
 7 Prod2         27.9      5      1
 8 Prod3         13.6      1      3
 9 Prod3         15.0      2      2
10 Prod3         18.8      3      1


Index 1 is counter based on Product and increasing sales whereas Index2 is a counter based on Product and decreasing sales.

Step 3: Highest Sales for each Product

sales_highest<-df2%>%
  filter(Index2==1)

sales_highest
# A tibble: 3 x 4
# Groups:   Product_Line [3]
  Product_Line Sales Index1 Index2
  <chr>        <dbl>  <int>  <int>
1 Prod1         24.8      2      1
2 Prod2         27.9      5      1
3 Prod3         18.8      3      1


Step 4: Lowest Sales for each Product

sales_lowest<-df2%>%
  filter(Index1==1)

sales_lowest
# A tibble: 3 x 4
# Groups:   Product_Line [3]
  Product_Line Sales Index1 Index2
  <chr>        <dbl>  <int>  <int>
1 Prod1         19.4      1      2
2 Prod2         11.1      1      5
3 Prod3         13.6      1      3


My Youtube Channel

Embed Shiny

Please wait...