Wednesday, March 2, 2022

Summarise multiple numeric columns in R

Summarise Multiple Columns


Introduction

There are certain situation where we are required to summarise multiple columns based on certain group by conditions.For example:

  • Estimate mean of Sales, Cost and Revenue based on different products
  • Estimate mean of daily blood pressure readings over a month

In this blog we will look at how dplyr can be used to generate such summaries

package.name<-c("dplyr","stringr")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

Step 1: Creating a data frame

We are creating a dummy data frame with details of Sales, Cost and Gross Margin for 100 products

df<-data.frame(Product_Line=str_c("Prod_",1:100),
               Product_Category=sample(c("Cat1","Cat2","Cat3"),100,T),
               Sales=rnorm(100,20,4),
               Cost=rnorm(100,10,2),
               Gross_Margin=rnorm(100,5,4))
head(df)
  Product_Line Product_Category    Sales      Cost Gross_Margin
1       Prod_1             Cat2 24.86951  8.642321    5.0200805
2       Prod_2             Cat2 23.83569  9.172208    9.5592231
3       Prod_3             Cat1 18.25577 10.361219    7.8213940
4       Prod_4             Cat3 25.75923  8.931923    0.9667511
5       Prod_5             Cat3 23.26624 10.111367    9.8294859
6       Prod_6             Cat1 12.96809  9.003059    4.5660069


Step 2: Using ‘across’ function within ‘summarise’

We need to generate mean of sales, cost and profit across product category

summary.df<-df%>%
  select(-Product_Line)%>%
  group_by(Product_Category)%>%
  summarise(across(everything(),list(mean)))

summary.df
# A tibble: 3 x 4
  Product_Category Sales_1 Cost_1 Gross_Margin_1
  <chr>              <dbl>  <dbl>          <dbl>
1 Cat1                20.3   9.74           4.72
2 Cat2                20.0   9.56           4.71
3 Cat3                19.5  10.4            5.70


Step 3: Renaming columns to appropriate names

You can see that the summarised sales, cost and gross margin has a ‘1’ suffixed at the end.Lets clean this up

nm<-gsub("_1","",colnames(summary.df)[2:4])
colnames(summary.df)[2:4]<-nm

head(summary.df)
# A tibble: 3 x 4
  Product_Category Sales  Cost Gross_Margin
  <chr>            <dbl> <dbl>        <dbl>
1 Cat1              20.3  9.74         4.72
2 Cat2              20.0  9.56         4.71
3 Cat3              19.5 10.4          5.70

My Youtube Channel

Embed Shiny

Please wait...