There are certain situation where we are required to identify the number of consecutive weeks of share price decline.These cases can be modeled by resetting a running counter in python¶
In this blog we will look at how pandas and numpy can be used to generate such summaries.We will be using the weeky price of avacado sold in US.This dataset is available in 'avocado' library from R and we will be using the csv version of it
Step 1: Importing the libraries and dataset¶
In [296]:
import pandas as pd
import numpy as np
import os
pd.options.mode.chained_assignment = None # To disable warnings
In [307]:
df=pd.read_csv("Price_Data.csv")
df.head()
Out[307]:
week_ending | location | region | avg_price_nonorg | plu4046 | plu4225 | plu4770 | small_nonorg_bag | large_nonorg_bag | xlarge_nonorg_bag | avg_price_org | plu94046 | plu94225 | plu94770 | small_org_bag | large_org_bag | xlarge_org_bag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2017-01-02 | Albany | Northeast | 1.47 | 4845.77 | 117027.41 | 200.36 | 7866.86 | 7.83 | 0.00 | 1.87 | 71.65 | 192.63 | 0.00 | 1112.42 | 0.00 | 0.0 |
1 | 2017-01-02 | Atlanta | Southeast | 0.93 | 224073.54 | 118926.37 | 337.48 | 111599.58 | 92628.91 | 0.00 | 1.81 | 956.73 | 2862.95 | 0.00 | 5.55 | 1517.62 | 0.0 |
2 | 2017-01-02 | Baltimore/Washington | Midsouth | 1.47 | 54530.42 | 408952.26 | 14387.01 | 151345.59 | 2542.41 | 3.12 | 1.92 | 1420.47 | 6298.07 | 325.44 | 5857.48 | 0.00 | 0.0 |
3 | 2017-01-02 | Boise | West | 0.92 | 27845.16 | 9408.92 | 11341.75 | 53093.47 | 2793.61 | 27.20 | 1.05 | 0.00 | 368.63 | 0.00 | 577.91 | 1877.28 | 0.0 |
4 | 2017-01-02 | Boston | Northeast | 1.29 | 4119.90 | 371223.34 | 3933.72 | 79339.78 | 213.75 | 0.00 | 2.06 | 14.80 | 2181.53 | 0.00 | 10636.25 | 605.64 | 0.0 |
In the above dataset, we will look at average price on non organic avocado across various regions for different week ending.We will look at the maximum number of weeks for which there was a reduction in price for a given region¶
Step 2:Developing the Logic¶
To start with, we will try and develop the logic for location say 'Albany'.The basic steps for the process would be the following:¶
- 1.Group by at the location level
- 2.Take the Lag of the Price/Sales etc
- 3.Wherever the Difference between Price and Lagged Price is > 0, then label it as 0, otherwise 1
- 4.Create a running counter based on the value of Difference Flag
- 5.Filter on Difference==1
- 6.Now generate the total count of records grouped by location,Difference and Counter
- 7.For each location, sort based on descending value on the above total count
- 8.The first row for each location is the maximum weeks of continuous decrease for Price/Sales
In [308]:
# Arranging by Location and Week Ending
df_1=df[['location','week_ending','avg_price_nonorg']][df['location']=="Albany"].sort_values(['location','week_ending'])
df_1.head()
Out[308]:
location | week_ending | avg_price_nonorg | |
---|---|---|---|
0 | Albany | 2017-01-02 | 1.47 |
45 | Albany | 2017-01-08 | 1.55 |
90 | Albany | 2017-01-15 | 1.55 |
135 | Albany | 2017-01-22 | 1.59 |
180 | Albany | 2017-01-29 | 1.31 |
In [311]:
# Group by at the location level and taking lag of Price
df_1['avg_price_nonorg_lag']=df_1.groupby('location').avg_price_nonorg.shift(1)
df_1.head()
Out[311]:
location | week_ending | avg_price_nonorg | avg_price_nonorg_lag | |
---|---|---|---|---|
0 | Albany | 2017-01-02 | 1.47 | NaN |
45 | Albany | 2017-01-08 | 1.55 | 1.47 |
90 | Albany | 2017-01-15 | 1.55 | 1.55 |
135 | Albany | 2017-01-22 | 1.59 | 1.55 |
180 | Albany | 2017-01-29 | 1.31 | 1.59 |
In [312]:
# Removing the NaN values
df_2=df_1[~np.isnan(df_1['avg_price_nonorg_lag'])]
df_2.head()
Out[312]:
location | week_ending | avg_price_nonorg | avg_price_nonorg_lag | |
---|---|---|---|---|
45 | Albany | 2017-01-08 | 1.55 | 1.47 |
90 | Albany | 2017-01-15 | 1.55 | 1.55 |
135 | Albany | 2017-01-22 | 1.59 | 1.55 |
180 | Albany | 2017-01-29 | 1.31 | 1.59 |
225 | Albany | 2017-02-05 | 1.49 | 1.31 |
In [313]:
# Creating the Price Diff Column at location level
df_2['Price_Diff']=df_2.groupby('location').avg_price_nonorg.diff()
df_2.head()
Out[313]:
location | week_ending | avg_price_nonorg | avg_price_nonorg_lag | Price_Diff | |
---|---|---|---|---|---|
45 | Albany | 2017-01-08 | 1.55 | 1.47 | NaN |
90 | Albany | 2017-01-15 | 1.55 | 1.55 | 0.00 |
135 | Albany | 2017-01-22 | 1.59 | 1.55 | 0.04 |
180 | Albany | 2017-01-29 | 1.31 | 1.59 | -0.28 |
225 | Albany | 2017-02-05 | 1.49 | 1.31 | 0.18 |
In [316]:
# Creating a logical flag based on value of Price_Diff
Flag = df_2['Price_Diff'].fillna(0) > 0
In [324]:
# Creating a Counter which increments only when Flag = True
cntr = (mask==True).cumsum()
df_2['Flag']=Flag
df_2['Cntr']=cntr
df_2.head(10)
Out[324]:
location | week_ending | avg_price_nonorg | avg_price_nonorg_lag | Price_Diff | Flag | Cntr | |
---|---|---|---|---|---|---|---|
45 | Albany | 2017-01-08 | 1.55 | 1.47 | NaN | False | 0 |
90 | Albany | 2017-01-15 | 1.55 | 1.55 | 0.00 | False | 0 |
135 | Albany | 2017-01-22 | 1.59 | 1.55 | 0.04 | True | 1 |
180 | Albany | 2017-01-29 | 1.31 | 1.59 | -0.28 | False | 1 |
225 | Albany | 2017-02-05 | 1.49 | 1.31 | 0.18 | True | 2 |
270 | Albany | 2017-02-12 | 1.42 | 1.49 | -0.07 | False | 2 |
315 | Albany | 2017-02-19 | 1.67 | 1.42 | 0.25 | True | 3 |
360 | Albany | 2017-02-26 | 1.40 | 1.67 | -0.27 | False | 3 |
405 | Albany | 2017-03-05 | 1.18 | 1.40 | -0.22 | False | 3 |
450 | Albany | 2017-03-12 | 1.54 | 1.18 | 0.36 | True | 4 |
In [325]:
# Filter on Flag == T will give me the records where price diff is negative
df_3=df_2[df_2['Flag']==False]
df_3.head(10)
Out[325]:
location | week_ending | avg_price_nonorg | avg_price_nonorg_lag | Price_Diff | Flag | Cntr | |
---|---|---|---|---|---|---|---|
45 | Albany | 2017-01-08 | 1.55 | 1.47 | NaN | False | 0 |
90 | Albany | 2017-01-15 | 1.55 | 1.55 | 0.00 | False | 0 |
180 | Albany | 2017-01-29 | 1.31 | 1.59 | -0.28 | False | 1 |
270 | Albany | 2017-02-12 | 1.42 | 1.49 | -0.07 | False | 2 |
360 | Albany | 2017-02-26 | 1.40 | 1.67 | -0.27 | False | 3 |
405 | Albany | 2017-03-05 | 1.18 | 1.40 | -0.22 | False | 3 |
540 | Albany | 2017-03-26 | 1.16 | 1.60 | -0.44 | False | 5 |
630 | Albany | 2017-04-09 | 1.54 | 1.62 | -0.08 | False | 6 |
720 | Albany | 2017-04-23 | 1.34 | 1.62 | -0.28 | False | 7 |
765 | Albany | 2017-04-30 | 1.13 | 1.34 | -0.21 | False | 7 |
In [326]:
# Taking the total numbers of records by location, counter
# Sorting by Total counts will results in the counter with
# most number of successive weeks decline
df_4=df_3[['location','Cntr']].groupby(['location','Cntr']).size().reset_index()
df_4.columns=['location','Cntr','Total_Count']
df_5=df_4.sort_values(['Total_Count'],ascending=False)
df_5.head()
Out[326]:
location | Cntr | Total_Count | |
---|---|---|---|
59 | Albany | 99 | 4 |
39 | Albany | 68 | 4 |
8 | Albany | 10 | 4 |
43 | Albany | 74 | 3 |
56 | Albany | 95 | 3 |
In [328]:
# Get the weeks where there was most consecutive decline.This will ge for group 105
df_2[df_2['Cntr']==99]
Out[328]:
location | week_ending | avg_price_nonorg | avg_price_nonorg_lag | Price_Diff | Flag | Cntr | |
---|---|---|---|---|---|---|---|
8506 | Albany | 2020-09-27 | 1.33 | 1.01 | 0.32 | True | 99 |
8505 | Albany | 2020-10-04 | 1.23 | 1.33 | -0.10 | False | 99 |
8688 | Albany | 2020-10-11 | 1.10 | 1.23 | -0.13 | False | 99 |
8687 | Albany | 2020-10-18 | 1.03 | 1.10 | -0.07 | False | 99 |
8686 | Albany | 2020-10-25 | 0.96 | 1.03 | -0.07 | False | 99 |
No comments:
Post a Comment