Reset a running counter within groupby in Python

In this blog we will look at how pandas and numpy can be used to generate such summaries.We will be using the weeky price of avacado sold in US.This dataset is available in 'avocado' library from R and we will be using the csv version of it

Step 1: Importing the libraries and dataset¶

In [296]:

import pandas as pd
import numpy as np
import os
pd.options.mode.chained_assignment = None # To disable warnings

In [307]:

df=pd.read_csv("Price_Data.csv")
df.head()

Out[307]:

	week_ending	location	region	avg_price_nonorg	plu4046	plu4225	plu4770	small_nonorg_bag	large_nonorg_bag	xlarge_nonorg_bag	avg_price_org	plu94046	plu94225	plu94770	small_org_bag	large_org_bag
0	2017-01-02	Albany	Northeast	1.47	4845.77	117027.41	200.36	7866.86	7.83	0.00	1.87	71.65	192.63	0.00	1112.42	0.00
1	2017-01-02	Atlanta	Southeast	0.93	224073.54	118926.37	337.48	111599.58	92628.91	0.00	1.81	956.73	2862.95	0.00	5.55	1517.62
2	2017-01-02	Baltimore/Washington	Midsouth	1.47	54530.42	408952.26	14387.01	151345.59	2542.41	3.12	1.92	1420.47	6298.07	325.44	5857.48	0.00
3	2017-01-02	Boise	West	0.92	27845.16	9408.92	11341.75	53093.47	2793.61	27.20	1.05	0.00	368.63	0.00	577.91	1877.28
4	2017-01-02	Boston	Northeast	1.29	4119.90	371223.34	3933.72	79339.78	213.75	0.00	2.06	14.80	2181.53	0.00	10636.25	605.64

In the above dataset, we will look at average price on non organic avocado across various regions for different week ending.We will look at the maximum number of weeks for which there was a reduction in price for a given region¶

Step 2:Developing the Logic¶

To start with, we will try and develop the logic for location say 'Albany'.The basic steps for the process would be the following:¶

1.Group by at the location level
2.Take the Lag of the Price/Sales etc
3.Wherever the Difference between Price and Lagged Price is > 0, then label it as 0, otherwise 1
4.Create a running counter based on the value of Difference Flag
5.Filter on Difference==1
6.Now generate the total count of records grouped by location,Difference and Counter
7.For each location, sort based on descending value on the above total count
8.The first row for each location is the maximum weeks of continuous decrease for Price/Sales

In [308]:

# Arranging by Location and Week Ending
df_1=df[['location','week_ending','avg_price_nonorg']][df['location']=="Albany"].sort_values(['location','week_ending'])
df_1.head()

Out[308]:

	location	week_ending	avg_price_nonorg
0	Albany	2017-01-02	1.47
45	Albany	2017-01-08	1.55
90	Albany	2017-01-15	1.55
135	Albany	2017-01-22	1.59
180	Albany	2017-01-29	1.31

In [311]:

# Group by at the location level and taking lag of Price
df_1['avg_price_nonorg_lag']=df_1.groupby('location').avg_price_nonorg.shift(1)
df_1.head()

Out[311]:

	location	week_ending	avg_price_nonorg	avg_price_nonorg_lag
0	Albany	2017-01-02	1.47	NaN
45	Albany	2017-01-08	1.55	1.47
90	Albany	2017-01-15	1.55	1.55
135	Albany	2017-01-22	1.59	1.55
180	Albany	2017-01-29	1.31	1.59

In [312]:

# Removing the NaN values
df_2=df_1[~np.isnan(df_1['avg_price_nonorg_lag'])]
df_2.head()

Out[312]:

	location	week_ending	avg_price_nonorg	avg_price_nonorg_lag
45	Albany	2017-01-08	1.55	1.47
90	Albany	2017-01-15	1.55	1.55
135	Albany	2017-01-22	1.59	1.55
180	Albany	2017-01-29	1.31	1.59
225	Albany	2017-02-05	1.49	1.31

In [313]:

# Creating the Price Diff Column at location level
df_2['Price_Diff']=df_2.groupby('location').avg_price_nonorg.diff()
df_2.head()

Out[313]:

	location	week_ending	avg_price_nonorg	avg_price_nonorg_lag	Price_Diff
45	Albany	2017-01-08	1.55	1.47	NaN
90	Albany	2017-01-15	1.55	1.55	0.00
135	Albany	2017-01-22	1.59	1.55	0.04
180	Albany	2017-01-29	1.31	1.59	-0.28
225	Albany	2017-02-05	1.49	1.31	0.18

In [316]:

# Creating a logical flag based on value of Price_Diff
Flag = df_2['Price_Diff'].fillna(0) > 0

In [324]:

# Creating a Counter which increments only when Flag = True
cntr = (mask==True).cumsum()

df_2['Flag']=Flag
df_2['Cntr']=cntr
df_2.head(10)

Out[324]:

	location	week_ending	avg_price_nonorg	avg_price_nonorg_lag	Price_Diff	Flag	Cntr
45	Albany	2017-01-08	1.55	1.47	NaN	False	0
90	Albany	2017-01-15	1.55	1.55	0.00	False	0
135	Albany	2017-01-22	1.59	1.55	0.04	True	1
180	Albany	2017-01-29	1.31	1.59	-0.28	False	1
225	Albany	2017-02-05	1.49	1.31	0.18	True	2
270	Albany	2017-02-12	1.42	1.49	-0.07	False	2
315	Albany	2017-02-19	1.67	1.42	0.25	True	3
360	Albany	2017-02-26	1.40	1.67	-0.27	False	3
405	Albany	2017-03-05	1.18	1.40	-0.22	False	3
450	Albany	2017-03-12	1.54	1.18	0.36	True	4

In [325]:

# Filter on Flag == T will give me the records where price diff is negative
df_3=df_2[df_2['Flag']==False]
df_3.head(10)

Out[325]:

	location	week_ending	avg_price_nonorg	avg_price_nonorg_lag	Price_Diff	Flag	Cntr
45	Albany	2017-01-08	1.55	1.47	NaN	False	0
90	Albany	2017-01-15	1.55	1.55	0.00	False	0
180	Albany	2017-01-29	1.31	1.59	-0.28	False	1
270	Albany	2017-02-12	1.42	1.49	-0.07	False	2
360	Albany	2017-02-26	1.40	1.67	-0.27	False	3
405	Albany	2017-03-05	1.18	1.40	-0.22	False	3
540	Albany	2017-03-26	1.16	1.60	-0.44	False	5
630	Albany	2017-04-09	1.54	1.62	-0.08	False	6
720	Albany	2017-04-23	1.34	1.62	-0.28	False	7
765	Albany	2017-04-30	1.13	1.34	-0.21	False	7

In [326]:

# Taking the total numbers of records by location, counter
# Sorting by Total counts will results in the counter with 
# most number of successive weeks decline
df_4=df_3[['location','Cntr']].groupby(['location','Cntr']).size().reset_index()
df_4.columns=['location','Cntr','Total_Count']
df_5=df_4.sort_values(['Total_Count'],ascending=False)
df_5.head()

Out[326]:

	location	Cntr	Total_Count
59	Albany	99	4
39	Albany	68	4
8	Albany	10	4
43	Albany	74	3
56	Albany	95	3

In [328]:

# Get the weeks where there was most consecutive decline.This will ge for group 105
df_2[df_2['Cntr']==99]

Out[328]:

	location	week_ending	avg_price_nonorg	avg_price_nonorg_lag	Price_Diff	Flag	Cntr
8506	Albany	2020-09-27	1.33	1.01	0.32	True	99
8505	Albany	2020-10-04	1.23	1.33	-0.10	False	99
8688	Albany	2020-10-11	1.10	1.23	-0.13	False	99
8687	Albany	2020-10-18	1.03	1.10	-0.07	False	99
8686	Albany	2020-10-25	0.96	1.03	-0.07	False	99

Machine Learning Made Easy

Friday, March 25, 2022

Get the Maximum weeks of continuous price decline in Python

Step 1: Importing the libraries and dataset¶

In the above dataset, we will look at average price on non organic avocado across various regions for different week ending.We will look at the maximum number of weeks for which there was a reduction in price for a given region¶

Step 2:Developing the Logic¶

To start with, we will try and develop the logic for location say 'Albany'.The basic steps for the process would be the following:¶

No comments:

Post a Comment

Embed Shiny

Friday, March 25, 2022

Get the Maximum weeks of continuous price decline in Python

There are certain situation where we are required to identify the number of consecutive weeks of share price decline.These cases can be modeled by resetting a running counter in python¶

Step 1: Importing the libraries and dataset¶

In the above dataset, we will look at average price on non organic avocado across various regions for different week ending.We will look at the maximum number of weeks for which there was a reduction in price for a given region¶

Step 2:Developing the Logic¶

To start with, we will try and develop the logic for location say 'Albany'.The basic steps for the process would be the following:¶

No comments:

Post a Comment

Embed Shiny