Summarise Multiple Columns
Parag Verma
3rd March, 2022
Introduction
There are certain situation where we are required to summarise multiple columns based on certain group by conditions.For example:
- Estimate mean of Sales, Cost and Revenue based on different products
- Estimate mean of daily blood pressure readings over a month
In this blog we will look at how dplyr can be used to generate such summaries
package.name<-c("dplyr","stringr")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
Step 1: Creating a data frame
We are creating a dummy data frame with details of Sales, Cost and Gross Margin for 100 products
df<-data.frame(Product_Line=str_c("Prod_",1:100),
Product_Category=sample(c("Cat1","Cat2","Cat3"),100,T),
Sales=rnorm(100,20,4),
Cost=rnorm(100,10,2),
Gross_Margin=rnorm(100,5,4))
head(df)
Product_Line Product_Category Sales Cost Gross_Margin
1 Prod_1 Cat2 24.86951 8.642321 5.0200805
2 Prod_2 Cat2 23.83569 9.172208 9.5592231
3 Prod_3 Cat1 18.25577 10.361219 7.8213940
4 Prod_4 Cat3 25.75923 8.931923 0.9667511
5 Prod_5 Cat3 23.26624 10.111367 9.8294859
6 Prod_6 Cat1 12.96809 9.003059 4.5660069
Step 2: Using ‘across’ function within ‘summarise’
We need to generate mean of sales, cost and profit across product category
summary.df<-df%>%
select(-Product_Line)%>%
group_by(Product_Category)%>%
summarise(across(everything(),list(mean)))
summary.df
# A tibble: 3 x 4
Product_Category Sales_1 Cost_1 Gross_Margin_1
<chr> <dbl> <dbl> <dbl>
1 Cat1 20.3 9.74 4.72
2 Cat2 20.0 9.56 4.71
3 Cat3 19.5 10.4 5.70
Step 3: Renaming columns to appropriate names
You can see that the summarised sales, cost and gross margin has a ‘1’ suffixed at the end.Lets clean this up
nm<-gsub("_1","",colnames(summary.df)[2:4])
colnames(summary.df)[2:4]<-nm
head(summary.df)
# A tibble: 3 x 4
Product_Category Sales Cost Gross_Margin
<chr> <dbl> <dbl> <dbl>
1 Cat1 20.3 9.74 4.72
2 Cat2 20.0 9.56 4.71
3 Cat3 19.5 10.4 5.70