Introduction¶
In this blog we will look at how to find the price elasticity of volume using python. We will use log regression models to istimate elasticity
Step 1: Importing the libraries¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sklearn as sk
import statsmodels.api as sm
from statsmodels.api import OLS
sklean package will be used to run the regression model
Step 2: Importing the dataset¶
The dataset we will use is the cheese dataset where volume, price and disp(a measure of marketing activity) has been specified for various retailers in the US
os.getcwd()
'C:\\Users\\shubh'
In my example, the cheese data is present in the above directory
df = pd.read_csv("cheese.csv")
df.head()
RETAILER | VOLUME | DISP | PRICE | |
---|---|---|---|---|
0 | LOS ANGELES - LUCKY | 21374 | 0.162000 | 2.578460 |
1 | LOS ANGELES - RALPHS | 6427 | 0.124113 | 3.727867 |
2 | LOS ANGELES - VONS | 17302 | 0.102000 | 2.711421 |
3 | CHICAGO - DOMINICK | 13561 | 0.027591 | 2.651206 |
4 | CHICAGO - JEWEL | 42774 | 0.090613 | 1.986674 |
The description of the columns are as follows
- RETAILER:List of Retailers
- VOLUME:Number of units sold
- DISP:A Measure of advertising display activity
- PRICE:Unit Price in Dollars
For each retailer, the data captures weekly sales of cheese. Here the dependent variable y is the units sold or VOLUME and PRICE is independent variable x. For sake of explanation, lets look at regression model for only one of the retailers such as ATLANTA - KROGER CO.
Step 3: Relationship between Price and Volume for each retailer¶
In theory, as price decreases volume increases.Since our dataset is at a retailer level, hence it makes sense to look at the correlation between PRICE and VOLUME at a retailer level first.
Lets get the correlation coefficient of PRICE and VOLUME for individual retailers
df.cor = df.groupby('RETAILER')[['VOLUME','PRICE']].corr()
df.cor2 = df.cor.reset_index()
df.cor
C:\Users\shubh\AppData\Local\Temp\ipykernel_18552\2520918798.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access df.cor = df.groupby('RETAILER')[['VOLUME','PRICE']].corr() C:\Users\shubh\AppData\Local\Temp\ipykernel_18552\2520918798.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access df.cor2 = df.cor.reset_index()
VOLUME | PRICE | ||
---|---|---|---|
RETAILER | |||
ALBANY,NY - PRICE CHOPPER | VOLUME | 1.000000 | -0.835684 |
PRICE | -0.835684 | 1.000000 | |
ATLANTA - KROGER CO | VOLUME | 1.000000 | -0.860380 |
PRICE | -0.860380 | 1.000000 | |
ATLANTA - WINN DIXIE | VOLUME | 1.000000 | -0.300265 |
... | ... | ... | ... |
TAMPA/ST. PETE - PUBLIX | PRICE | -0.728956 | 1.000000 |
TAMPA/ST. PETE - WINN DIXIE | VOLUME | 1.000000 | -0.620235 |
PRICE | -0.620235 | 1.000000 | |
WICHITA - DILLON COMPANIES | VOLUME | 1.000000 | -0.607738 |
PRICE | -0.607738 | 1.000000 |
176 rows × 2 columns
We can see that the output is a little jumbled.For every retailer, it sort of gives a small correlation matrix. So what we need to do here is that we will select every other row using iloc and get the Retailer and Price columns
df.cor3 = df.cor2.iloc[::2][['RETAILER','PRICE']]
df.cor3
C:\Users\shubh\AppData\Local\Temp\ipykernel_18552\1296015483.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access df.cor3 = df.cor2.iloc[::2][['RETAILER','PRICE']]
RETAILER | PRICE | |
---|---|---|
0 | ALBANY,NY - PRICE CHOPPER | -0.835684 |
2 | ATLANTA - KROGER CO | -0.860380 |
4 | ATLANTA - WINN DIXIE | -0.300265 |
6 | BALTI/WASH - GIANT FOOD INC | -0.909117 |
8 | BALTI/WASH - SAFEWAY | -0.648790 |
... | ... | ... |
166 | SYRACUSE - WEGMANS | -0.881431 |
168 | TAMPA/ST. PETE - KASH N KARRY | -0.553440 |
170 | TAMPA/ST. PETE - PUBLIX | -0.728956 |
172 | TAMPA/ST. PETE - WINN DIXIE | -0.620235 |
174 | WICHITA - DILLON COMPANIES | -0.607738 |
88 rows × 2 columns
Now we will rename the price column to Correlation
new_nm = {'PRICE': 'Correlation_Coefficient'}
df.cor3.rename(columns=new_nm,inplace=True)
df.cor3.sort_values('Correlation_Coefficient', ascending = False)
RETAILER | Correlation_Coefficient | |
---|---|---|
152 | SOUTH CAROLINA - BI LO | 0.405942 |
28 | CHARLOTTE - BI LO | 0.378682 |
142 | SALT LAKE CITY - SMITHS FOOD | -0.099931 |
74 | JACKSONVILLE,FL - FOOD LION | -0.175210 |
154 | SOUTH CAROLINA - FOOD LION | -0.182052 |
... | ... | ... |
40 | CHICAGO - OMNI | -0.903185 |
6 | BALTI/WASH - GIANT FOOD INC | -0.909117 |
36 | CHICAGO - DOMINICK | -0.927275 |
24 | BUFFALO/ROCHESTER - TOPS MARKETS | -0.930342 |
140 | SACRAMENTO - RALEYS | -0.937228 |
88 rows × 2 columns
Correlation coefficient for two retailers are positive.This is a skewed realtionship.Normally price increase leads to decline in volume.
Plotting the bar chart for correlation¶
plt.bar(df.cor3['RETAILER'], df.cor3['Correlation_Coefficient'], color='skyblue')
# Adding labels and title
plt.xticks([])
plt.title('Correlation Chart')
Text(0.5, 1.0, 'Correlation Chart')
Step 4: Fitting a Sactterplot between Price and Volume for each retailer¶
Lets first remove the retailers with positive value of correlation coefficient
reailer_nm = df.cor3[df.cor3['Correlation_Coefficient'] < 0]['RETAILER']
reailer_nm_list = list(reailer_nm)
reailer_nm_list[1:5]
['ATLANTA - KROGER CO', 'ATLANTA - WINN DIXIE', 'BALTI/WASH - GIANT FOOD INC', 'BALTI/WASH - SAFEWAY']
df2 = df [df['RETAILER'].isin(reailer_nm_list)]
df2
RETAILER | VOLUME | DISP | PRICE | |
---|---|---|---|---|
0 | LOS ANGELES - LUCKY | 21374 | 0.162000 | 2.578460 |
1 | LOS ANGELES - RALPHS | 6427 | 0.124113 | 3.727867 |
2 | LOS ANGELES - VONS | 17302 | 0.102000 | 2.711421 |
3 | CHICAGO - DOMINICK | 13561 | 0.027591 | 2.651206 |
4 | CHICAGO - JEWEL | 42774 | 0.090613 | 1.986674 |
... | ... | ... | ... | ... |
5550 | SYRACUSE - WEGMANS | 751 | 0.280321 | 3.338216 |
5551 | NEW ENGLAND (NORTH) - SHOP N SAVE | 5031 | 0.094000 | 2.379845 |
5552 | BIRMINGHAM/MONTGOM - KROGER | 1400 | 0.000000 | 2.654286 |
5553 | NEW YORK (NEW) - A & P | 3448 | 0.043846 | 3.524942 |
5554 | ROANOKE (NEW) - KROGER CO | 3791 | 0.009648 | 2.466631 |
5433 rows × 4 columns
Total reailers in original data frame
df['RETAILER'].nunique()
88
Total reailers in new data frame
df2['RETAILER'].nunique()
86
So the two retailers with positive value of correlation coefficient have been removed
Scatterplot between price and volume for one of the retailers¶
sns.set_style("darkgrid")
sns.lmplot(
x="PRICE",
y="VOLUME",
data=df2[df2['RETAILER'].isin(['SACRAMENTO - RALEYS'])]
)
plt.show()
df2['Log_PRICE'] = np.log(df2['PRICE'])
df2.head()
C:\Users\shubh\AppData\Local\Temp\ipykernel_18552\2844843800.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df2['Log_PRICE'] = np.log(df2['PRICE'])
RETAILER | VOLUME | DISP | PRICE | Log_PRICE | |
---|---|---|---|---|---|
0 | LOS ANGELES - LUCKY | 21374 | 0.162000 | 2.578460 | 0.947192 |
1 | LOS ANGELES - RALPHS | 6427 | 0.124113 | 3.727867 | 1.315836 |
2 | LOS ANGELES - VONS | 17302 | 0.102000 | 2.711421 | 0.997473 |
3 | CHICAGO - DOMINICK | 13561 | 0.027591 | 2.651206 | 0.975015 |
4 | CHICAGO - JEWEL | 42774 | 0.090613 | 1.986674 | 0.686462 |
sns.set_style("darkgrid")
sns.regplot(
x="PRICE",
y="VOLUME",
data=df2[df2['RETAILER'].isin(['SACRAMENTO - RALEYS'])],
order=2
)
plt.show()
sns.regplot(
x="PRICE",
y="VOLUME",
data=df2[df2['RETAILER'].isin(['SACRAMENTO - RALEYS'])],
lowess=True
)
plt.show()
From the above two charts, we can see that there is a non-linear relationship between price and volume.Meaning that for different price ranges, the decrease in volume due to price change might be different.For instance, in llower price range, the impact might be drastic while for higher price range the impact might not be that steep
Step 5: Fitting a regression model between Price and Volume for each retailer¶
from sklearn.linear_model import LinearRegression
mod1 = LinearRegression()
df_model = df2[df2['RETAILER'].isin(['SACRAMENTO - RALEYS'])]
df_model.head()
RETAILER | VOLUME | DISP | PRICE | Log_PRICE | |
---|---|---|---|---|---|
31 | SACRAMENTO - RALEYS | 1538 | 0.0870 | 3.719766 | 1.313661 |
118 | SACRAMENTO - RALEYS | 1302 | 0.0870 | 3.777266 | 1.329000 |
205 | SACRAMENTO - RALEYS | 1551 | 0.2655 | 3.823340 | 1.341124 |
292 | SACRAMENTO - RALEYS | 1506 | 0.0870 | 3.839973 | 1.345465 |
379 | SACRAMENTO - RALEYS | 1643 | 0.1740 | 3.853317 | 1.348934 |
X_Model = df_model[['Log_PRICE']]
X_Model.head()
Log_PRICE | |
---|---|
31 | 1.313661 |
118 | 1.329000 |
205 | 1.341124 |
292 | 1.345465 |
379 | 1.348934 |
X_Model.shape
(68, 1)
Y_Model = df_model[['VOLUME']]
Y_Model.head()
VOLUME | |
---|---|
31 | 1538 |
118 | 1302 |
205 | 1551 |
292 | 1506 |
379 | 1643 |
mod1.fit(X_Model, Y_Model)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
print('Coefficients: ', mod1.coef_)
Coefficients: [[-5907.86141214]]
print(mod1)
LinearRegression()
Regression Summary using stats model¶
#add constant to predictor variables
X_Model = sm.add_constant(X_Model)
#fit linear regression model
model = sm.OLS(Y_Model, X_Model).fit()
#view model summary
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: VOLUME R-squared: 0.885 Model: OLS Adj. R-squared: 0.883 Method: Least Squares F-statistic: 507.8 Date: Mon, 28 Apr 2025 Prob (F-statistic): 1.05e-32 Time: 21:49:08 Log-Likelihood: -487.26 No. Observations: 68 AIC: 978.5 Df Residuals: 66 BIC: 983.0 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 9430.6101 328.517 28.707 0.000 8774.704 1.01e+04 Log_PRICE -5907.8614 262.172 -22.534 0.000 -6431.305 -5384.418 ============================================================================== Omnibus: 4.406 Durbin-Watson: 1.870 Prob(Omnibus): 0.110 Jarque-Bera (JB): 3.857 Skew: -0.358 Prob(JB): 0.145 Kurtosis: 3.922 Cond. No. 17.4 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Extracting Rsquare,
# Prob of overall model,
# price coefficient and
# p value of price cofficient
coff = list(model.params)
beta = coff[1]
beta
-5907.86141213528
# p values
p_val = list(model.pvalues)
p_val
[5.256819652752006e-39, 1.0524954557400205e-32]
p_val_coff = p_val[1]
p_val_coff
1.0524954557400205e-32
# R qaure
rquare = model.rsquared
rquare
0.8849762419678276
# Overall p vale
model.f_pvalue
1.052495455740018e-32
Step 6: Interpretation¶
For 1% increase in price, the volume reduces by -5907/100 (-59 units)
Step 7: Implementing the model for all retailers¶
# List of retailers
reailer_nm_list[1:5]
['ATLANTA - KROGER CO', 'ATLANTA - WINN DIXIE', 'BALTI/WASH - GIANT FOOD INC', 'BALTI/WASH - SAFEWAY']
len(reailer_nm_list)
86
ls=[]
for i in reailer_nm_list:
#print(i,sep=',')
mod1 = LinearRegression()
df_model = df2[df2['RETAILER'].isin([i])]
X_Model = df_model[['Log_PRICE']]
Y_Model = df_model[['VOLUME']]
#add constant to predictor variables
X_Model = sm.add_constant(X_Model)
#fit linear regression model
model = sm.OLS(Y_Model, X_Model).fit()
#view model summary
# print(model.summary())
# Beta
coff = list(model.params)
beta = round(coff[1],2)
# p values of coff
p_val = list(model.pvalues)
p_val_coff = round(p_val[1],2)
# R quare
rquare = round(model.rsquared,2)
# Overall p value
p_val_overall = round(model.f_pvalue,2)
# regression summaru
summary_df = pd.DataFrame(np.array([i,p_val_overall,rquare,beta,p_val_coff])).T
# Adding to the list
ls.append(summary_df)
ls[0:3]
[ 0 1 2 3 4 0 ALBANY,NY - PRICE CHOPPER 0.0 0.73 -9808.87 0.0, 0 1 2 3 4 0 ATLANTA - KROGER CO 0.0 0.75 -9497.04 0.0, 0 1 2 3 4 0 ATLANTA - WINN DIXIE 0.02 0.09 -6857.56 0.02]
final_df = pd.concat(ls)
final_df.head()
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | ALBANY,NY - PRICE CHOPPER | 0.0 | 0.73 | -9808.87 | 0.0 |
0 | ATLANTA - KROGER CO | 0.0 | 0.75 | -9497.04 | 0.0 |
0 | ATLANTA - WINN DIXIE | 0.02 | 0.09 | -6857.56 | 0.02 |
0 | BALTI/WASH - GIANT FOOD INC | 0.0 | 0.83 | -24919.37 | 0.0 |
0 | BALTI/WASH - SAFEWAY | 0.0 | 0.42 | -11629.34 | 0.0 |
colnames = ['Retailer','P_Value_Overall','R_Square','Beta_Coefficient','P_Value_Price']
colnames
['Retailer', 'P_Value_Overall', 'R_Square', 'Beta_Coefficient', 'P_Value_Price']
final_df.columns = colnames
final_df.sort_values('R_Square', ascending = False).head()
Retailer | P_Value_Overall | R_Square | Beta_Coefficient | P_Value_Price | |
---|---|---|---|---|---|
0 | CHICAGO - DOMINICK | 0.0 | 0.91 | -81594.61 | 0.0 |
0 | BUFFALO/ROCHESTER - TOPS MARKETS | 0.0 | 0.9 | -59225.81 | 0.0 |
0 | SACRAMENTO - RALEYS | 0.0 | 0.88 | -5907.86 | 0.0 |
0 | SYRACUSE - WEGMANS | 0.0 | 0.83 | -11724.76 | 0.0 |
0 | BALTI/WASH - GIANT FOOD INC | 0.0 | 0.83 | -24919.37 | 0.0 |