There are instances where we are required to identify min, max or some other metrics within a groupby. In most of these cases, we often have to create an Index.In this blog, we will look at how to accomplish that¶
In [1]:
import pandas as pd
import numpy as np
Step 1: Creating the data frame¶
In [5]:
day = ['Day1','Day1','Day1','Day1','Day2','Day2','Day2']
product=['A','B','C','D','A','B','C']
sales=np.random.rand(7)*10
In [14]:
df=pd.DataFrame([day,product,sales]).T
df.columns=['Day','Product','Sales']
df
Out[14]:
Day | Product | Sales | |
---|---|---|---|
0 | Day1 | A | 0.278532 |
1 | Day1 | B | 8.951278 |
2 | Day1 | C | 5.662911 |
3 | Day1 | D | 3.734558 |
4 | Day2 | A | 8.485143 |
5 | Day2 | B | 3.149496 |
6 | Day2 | C | 4.681809 |
In [15]:
df.dtypes
Out[15]:
Day object Product object Sales object dtype: object
Step 2:Converting Sales into numeric data type¶
In [11]:
df['Sales']=df['Sales'].apply(pd.to_numeric,errors= "coerce")
Step 3: Create Cumulative Sales for each Day across different product¶
In [13]:
df['Cumulative_Sales']=df.groupby(['Day'])['Sales'].cumsum()
df
Out[13]:
Day | Product | Sales | Val | Cumulative_Sales | |
---|---|---|---|---|---|
0 | Day1 | A | 0.278532 | 1 | 0.278532 |
1 | Day1 | B | 8.951278 | 1 | 9.229811 |
2 | Day1 | C | 5.662911 | 1 | 14.892722 |
3 | Day1 | D | 3.734558 | 1 | 18.627280 |
4 | Day2 | A | 8.485143 | 1 | 8.485143 |
5 | Day2 | B | 3.149496 | 1 | 11.634639 |
6 | Day2 | C | 4.681809 | 1 | 16.316448 |
Step 4: Minimum sales from each day¶
In [16]:
df['Val']=1
In [18]:
df1=df.sort_values(['Day','Sales'])
df1
Out[18]:
Day | Product | Sales | Val | |
---|---|---|---|---|
0 | Day1 | A | 0.278532 | 1 |
3 | Day1 | D | 3.734558 | 1 |
2 | Day1 | C | 5.662911 | 1 |
1 | Day1 | B | 8.951278 | 1 |
5 | Day2 | B | 3.149496 | 1 |
6 | Day2 | C | 4.681809 | 1 |
4 | Day2 | A | 8.485143 | 1 |
Creating Index¶
In [20]:
df1['Index']=df1.groupby(['Day'])['Val'].cumsum()
df1
Out[20]:
Day | Product | Sales | Val | Index | |
---|---|---|---|---|---|
0 | Day1 | A | 0.278532 | 1 | 1 |
3 | Day1 | D | 3.734558 | 1 | 2 |
2 | Day1 | C | 5.662911 | 1 | 3 |
1 | Day1 | B | 8.951278 | 1 | 4 |
5 | Day2 | B | 3.149496 | 1 | 1 |
6 | Day2 | C | 4.681809 | 1 | 2 |
4 | Day2 | A | 8.485143 | 1 | 3 |
In [21]:
df1.loc[df1['Index']==1,]
Out[21]:
Day | Product | Sales | Val | Index | |
---|---|---|---|---|---|
0 | Day1 | A | 0.278532 | 1 | 1 |
5 | Day2 | B | 3.149496 | 1 | 1 |
Step 5: Maximum sales from each day¶
In [22]:
df2=df.sort_values(['Day','Sales'],ascending=[True,False])
df2
Out[22]:
Day | Product | Sales | Val | |
---|---|---|---|---|
1 | Day1 | B | 8.951278 | 1 |
2 | Day1 | C | 5.662911 | 1 |
3 | Day1 | D | 3.734558 | 1 |
0 | Day1 | A | 0.278532 | 1 |
4 | Day2 | A | 8.485143 | 1 |
6 | Day2 | C | 4.681809 | 1 |
5 | Day2 | B | 3.149496 | 1 |
In [23]:
df2['Index']=df2.groupby(['Day'])['Val'].cumsum()
df2.loc[df2['Index']==1,]
Out[23]:
Day | Product | Sales | Val | Index | |
---|---|---|---|---|---|
1 | Day1 | B | 8.951278 | 1 | 1 |
4 | Day2 | A | 8.485143 | 1 | 1 |