There are certain situation where we are required to summarise multiple columns based on certain group by conditions.For example:¶
- Estimate mean of Sales, Cost and Revenue based on different products
- Estimate mean of daily blood pressure readings over a month
In this blog we will look at how pandas can be used to generate such summaries
Step 1: Importing libraries¶
In [1]:
import numpy as np
import pandas as pd
Ste 2: Creating the mock dataset¶
In [18]:
# Creating Product List
product= list(range(1,101))
product_line = ["Prod_" + str(i) for i in product_line]
product_line[:5]
Out[18]:
['Prod_Prod_1', 'Prod_Prod_2', 'Prod_Prod_3', 'Prod_Prod_4', 'Prod_Prod_5']
In [36]:
# Creating Product Category
Product_category=np.random.choice(["Cat1","Cat2","Cat3"],100)
Product_category[:5]
Out[36]:
array(['Cat1', 'Cat3', 'Cat1', 'Cat1', 'Cat2'], dtype='<U4')
In [37]:
# Creating Sales, Cost and Gross Margin variables
sales=np.random.normal(20,4,100)
cost=np.random.normal(10,2,100)
gross_margin=np.random.normal(5,4,100)
In [38]:
# Creating the Data Frame
df=pd.DataFrame([product_line,Product_category,
sales,cost,gross_margin]).T
df.columns=['Product_line','Product_category','Sales','Cost','Gross_Margin']
df.head()
Out[38]:
Product_line | Product_category | Sales | Cost | Gross_Margin | |
---|---|---|---|---|---|
0 | Prod_Prod_1 | Cat1 | 17.998575 | 9.93475 | 6.683216 |
1 | Prod_Prod_2 | Cat3 | 10.013818 | 11.513078 | -1.625103 |
2 | Prod_Prod_3 | Cat1 | 24.596436 | 15.16316 | 9.200483 |
3 | Prod_Prod_4 | Cat1 | 19.28291 | 11.141888 | -1.180816 |
4 | Prod_Prod_5 | Cat2 | 18.665807 | 8.950452 | 7.985282 |
Step 3: Summarising columns are product level¶
In [45]:
interim_df=df.drop(['Product_line'],axis=1).groupby(['Product_category']).mean()
summary_df=interim_df.reset_index()
summary_df
Out[45]:
Product_category | Sales | Cost | Gross_Margin | |
---|---|---|---|---|
0 | Cat1 | 18.344428 | 10.019606 | 3.923550 |
1 | Cat2 | 19.102006 | 9.758918 | 5.571789 |
2 | Cat3 | 18.736389 | 10.235685 | 4.982042 |