Price Elasticity Analysis in Python

Introduction¶

In this blog we will look at how to find the price elasticity of volume using python. We will use log regression models to istimate elasticity

Step 1: Importing the libraries¶

In [265]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sklearn as sk
import statsmodels.api as sm
from statsmodels.api import OLS

sklean package will be used to run the regression model

Step 2: Importing the dataset¶

The dataset we will use is the cheese dataset where volume, price and disp(a measure of marketing activity) has been specified for various retailers in the US

Link

In [267]:

os.getcwd()

Out[267]:

'C:\\Users\\shubh'

In my example, the cheese data is present in the above directory

In [269]:

df = pd.read_csv("cheese.csv")

In [271]:

df.head()

Out[271]:

	RETAILER	VOLUME	DISP	PRICE
0	LOS ANGELES - LUCKY	21374	0.162000	2.578460
1	LOS ANGELES - RALPHS	6427	0.124113	3.727867
2	LOS ANGELES - VONS	17302	0.102000	2.711421
3	CHICAGO - DOMINICK	13561	0.027591	2.651206
4	CHICAGO - JEWEL	42774	0.090613	1.986674

The description of the columns are as follows

RETAILER:List of Retailers
VOLUME:Number of units sold
DISP:A Measure of advertising display activity
PRICE:Unit Price in Dollars

For each retailer, the data captures weekly sales of cheese. Here the dependent variable y is the units sold or VOLUME and PRICE is independent variable x. For sake of explanation, lets look at regression model for only one of the retailers such as ATLANTA - KROGER CO.

Step 3: Relationship between Price and Volume for each retailer¶

In theory, as price decreases volume increases.Since our dataset is at a retailer level, hence it makes sense to look at the correlation between PRICE and VOLUME at a retailer level first.

Lets get the correlation coefficient of PRICE and VOLUME for individual retailers

In [273]:

df.cor = df.groupby('RETAILER')[['VOLUME','PRICE']].corr()
df.cor2 = df.cor.reset_index()
df.cor

C:\Users\shubh\AppData\Local\Temp\ipykernel_18552\2520918798.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  df.cor = df.groupby('RETAILER')[['VOLUME','PRICE']].corr()
C:\Users\shubh\AppData\Local\Temp\ipykernel_18552\2520918798.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  df.cor2 = df.cor.reset_index()

Out[273]:

		VOLUME	PRICE
RETAILER
ALBANY,NY - PRICE CHOPPER	VOLUME	1.000000	-0.835684
ALBANY,NY - PRICE CHOPPER	PRICE	-0.835684	1.000000
ATLANTA - KROGER CO	VOLUME	1.000000	-0.860380
ATLANTA - KROGER CO	PRICE	-0.860380	1.000000
ATLANTA - WINN DIXIE	VOLUME	1.000000	-0.300265
...	...	...	...
TAMPA/ST. PETE - PUBLIX	PRICE	-0.728956	1.000000
TAMPA/ST. PETE - WINN DIXIE	VOLUME	1.000000	-0.620235
TAMPA/ST. PETE - WINN DIXIE	PRICE	-0.620235	1.000000
WICHITA - DILLON COMPANIES	VOLUME	1.000000	-0.607738
WICHITA - DILLON COMPANIES	PRICE	-0.607738	1.000000

176 rows × 2 columns

We can see that the output is a little jumbled.For every retailer, it sort of gives a small correlation matrix. So what we need to do here is that we will select every other row using iloc and get the Retailer and Price columns

In [275]:

df.cor3 = df.cor2.iloc[::2][['RETAILER','PRICE']]
df.cor3

C:\Users\shubh\AppData\Local\Temp\ipykernel_18552\1296015483.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  df.cor3 = df.cor2.iloc[::2][['RETAILER','PRICE']]

Out[275]:

	RETAILER	PRICE
0	ALBANY,NY - PRICE CHOPPER	-0.835684
2	ATLANTA - KROGER CO	-0.860380
4	ATLANTA - WINN DIXIE	-0.300265
6	BALTI/WASH - GIANT FOOD INC	-0.909117
8	BALTI/WASH - SAFEWAY	-0.648790
...	...	...
166	SYRACUSE - WEGMANS	-0.881431
168	TAMPA/ST. PETE - KASH N KARRY	-0.553440
170	TAMPA/ST. PETE - PUBLIX	-0.728956
172	TAMPA/ST. PETE - WINN DIXIE	-0.620235
174	WICHITA - DILLON COMPANIES	-0.607738

88 rows × 2 columns

Now we will rename the price column to Correlation

In [277]:

new_nm = {'PRICE': 'Correlation_Coefficient'}
df.cor3.rename(columns=new_nm,inplace=True)
df.cor3.sort_values('Correlation_Coefficient', ascending = False)

Out[277]:

	RETAILER	Correlation_Coefficient
152	SOUTH CAROLINA - BI LO	0.405942
28	CHARLOTTE - BI LO	0.378682
142	SALT LAKE CITY - SMITHS FOOD	-0.099931
74	JACKSONVILLE,FL - FOOD LION	-0.175210
154	SOUTH CAROLINA - FOOD LION	-0.182052
...	...	...
40	CHICAGO - OMNI	-0.903185
6	BALTI/WASH - GIANT FOOD INC	-0.909117
36	CHICAGO - DOMINICK	-0.927275
24	BUFFALO/ROCHESTER - TOPS MARKETS	-0.930342
140	SACRAMENTO - RALEYS	-0.937228

88 rows × 2 columns

Correlation coefficient for two retailers are positive.This is a skewed realtionship.Normally price increase leads to decline in volume.

Plotting the bar chart for correlation¶

In [279]:

plt.bar(df.cor3['RETAILER'], df.cor3['Correlation_Coefficient'], color='skyblue')
# Adding labels and title
plt.xticks([])
plt.title('Correlation Chart')

Out[279]:

Text(0.5, 1.0, 'Correlation Chart')

No description has been provided for this image

Step 4: Fitting a Sactterplot between Price and Volume for each retailer¶

Lets first remove the retailers with positive value of correlation coefficient

In [281]:

reailer_nm = df.cor3[df.cor3['Correlation_Coefficient'] < 0]['RETAILER']
reailer_nm_list = list(reailer_nm)
reailer_nm_list[1:5]

Out[281]:

['ATLANTA - KROGER CO',
 'ATLANTA - WINN DIXIE',
 'BALTI/WASH - GIANT FOOD INC',
 'BALTI/WASH - SAFEWAY']

In [283]:

df2 = df [df['RETAILER'].isin(reailer_nm_list)]
df2

Out[283]:

	RETAILER	VOLUME	DISP	PRICE
0	LOS ANGELES - LUCKY	21374	0.162000	2.578460
1	LOS ANGELES - RALPHS	6427	0.124113	3.727867
2	LOS ANGELES - VONS	17302	0.102000	2.711421
3	CHICAGO - DOMINICK	13561	0.027591	2.651206
4	CHICAGO - JEWEL	42774	0.090613	1.986674
...	...	...	...	...
5550	SYRACUSE - WEGMANS	751	0.280321	3.338216
5551	NEW ENGLAND (NORTH) - SHOP N SAVE	5031	0.094000	2.379845
5552	BIRMINGHAM/MONTGOM - KROGER	1400	0.000000	2.654286
5553	NEW YORK (NEW) - A & P	3448	0.043846	3.524942
5554	ROANOKE (NEW) - KROGER CO	3791	0.009648	2.466631

5433 rows × 4 columns

Total reailers in original data frame

In [285]:

df['RETAILER'].nunique()

Out[285]:

Total reailers in new data frame

In [155]:

df2['RETAILER'].nunique()

Out[155]:

So the two retailers with positive value of correlation coefficient have been removed

Scatterplot between price and volume for one of the retailers¶

In [288]:

sns.set_style("darkgrid")
sns.lmplot(
    x="PRICE",
    y="VOLUME",
    data=df2[df2['RETAILER'].isin(['SACRAMENTO - RALEYS'])]
)
plt.show()

In [290]:

df2['Log_PRICE'] = np.log(df2['PRICE'])
df2.head()

C:\Users\shubh\AppData\Local\Temp\ipykernel_18552\2844843800.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['Log_PRICE'] = np.log(df2['PRICE'])

Out[290]:

	RETAILER	VOLUME	DISP	PRICE	Log_PRICE
0	LOS ANGELES - LUCKY	21374	0.162000	2.578460	0.947192
1	LOS ANGELES - RALPHS	6427	0.124113	3.727867	1.315836
2	LOS ANGELES - VONS	17302	0.102000	2.711421	0.997473
3	CHICAGO - DOMINICK	13561	0.027591	2.651206	0.975015
4	CHICAGO - JEWEL	42774	0.090613	1.986674	0.686462

In [292]:

sns.set_style("darkgrid")
sns.regplot(
    x="PRICE",
    y="VOLUME",
    data=df2[df2['RETAILER'].isin(['SACRAMENTO - RALEYS'])],
    order=2
)
plt.show()

In [294]:

sns.regplot(
    x="PRICE",
    y="VOLUME",
    data=df2[df2['RETAILER'].isin(['SACRAMENTO - RALEYS'])],
    lowess=True
)
plt.show()

From the above two charts, we can see that there is a non-linear relationship between price and volume.Meaning that for different price ranges, the decrease in volume due to price change might be different.For instance, in llower price range, the impact might be drastic while for higher price range the impact might not be that steep

Step 5: Fitting a regression model between Price and Volume for each retailer¶

In [296]:

from sklearn.linear_model import LinearRegression

In [298]:

mod1 = LinearRegression()

In [300]:

df_model = df2[df2['RETAILER'].isin(['SACRAMENTO - RALEYS'])]
df_model.head()

Out[300]:

	RETAILER	VOLUME	DISP	PRICE	Log_PRICE
31	SACRAMENTO - RALEYS	1538	0.0870	3.719766	1.313661
118	SACRAMENTO - RALEYS	1302	0.0870	3.777266	1.329000
205	SACRAMENTO - RALEYS	1551	0.2655	3.823340	1.341124
292	SACRAMENTO - RALEYS	1506	0.0870	3.839973	1.345465
379	SACRAMENTO - RALEYS	1643	0.1740	3.853317	1.348934

In [302]:

X_Model = df_model[['Log_PRICE']]
X_Model.head()

Out[302]:

	Log_PRICE
31	1.313661
118	1.329000
205	1.341124
292	1.345465
379	1.348934

In [304]:

X_Model.shape

Out[304]:

(68, 1)

In [306]:

Y_Model = df_model[['VOLUME']]
Y_Model.head()

Out[306]:

	VOLUME
31	1538
118	1302
205	1551
292	1506
379	1643

In [308]:

mod1.fit(X_Model, Y_Model)

Out[308]:

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [310]:

print('Coefficients: ', mod1.coef_)

Coefficients:  [[-5907.86141214]]

In [312]:

print(mod1)

LinearRegression()

Regression Summary using stats model¶

In [314]:

#add constant to predictor variables
X_Model = sm.add_constant(X_Model)

#fit linear regression model
model = sm.OLS(Y_Model, X_Model).fit()

#view model summary
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 VOLUME   R-squared:                       0.885
Model:                            OLS   Adj. R-squared:                  0.883
Method:                 Least Squares   F-statistic:                     507.8
Date:                Mon, 28 Apr 2025   Prob (F-statistic):           1.05e-32
Time:                        21:49:08   Log-Likelihood:                -487.26
No. Observations:                  68   AIC:                             978.5
Df Residuals:                      66   BIC:                             983.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       9430.6101    328.517     28.707      0.000    8774.704    1.01e+04
Log_PRICE  -5907.8614    262.172    -22.534      0.000   -6431.305   -5384.418
==============================================================================
Omnibus:                        4.406   Durbin-Watson:                   1.870
Prob(Omnibus):                  0.110   Jarque-Bera (JB):                3.857
Skew:                          -0.358   Prob(JB):                        0.145
Kurtosis:                       3.922   Cond. No.                         17.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [ ]:

# Extracting Rsquare, 
# Prob of overall model, 
# price coefficient and 
# p value of price cofficient

In [341]:

coff = list(model.params)

In [347]:

beta = coff[1]
beta

Out[347]:

-5907.86141213528

In [351]:

# p values
p_val = list(model.pvalues) 
p_val

Out[351]:

[5.256819652752006e-39, 1.0524954557400205e-32]

In [355]:

p_val_coff = p_val[1]
p_val_coff

Out[355]:

1.0524954557400205e-32

In [359]:

# R qaure
rquare = model.rsquared
rquare

Out[359]:

0.8849762419678276

In [361]:

# Overall p vale
model.f_pvalue

Out[361]:

1.052495455740018e-32

Step 6: Interpretation¶

For 1% increase in price, the volume reduces by -5907/100 (-59 units)

Step 7: Implementing the model for all retailers¶

In [321]:

# List of retailers
reailer_nm_list[1:5]

Out[321]:

['ATLANTA - KROGER CO',
 'ATLANTA - WINN DIXIE',
 'BALTI/WASH - GIANT FOOD INC',
 'BALTI/WASH - SAFEWAY']

In [323]:

len(reailer_nm_list)

Out[323]:

In [465]:

ls=[]
for i in reailer_nm_list:
    #print(i,sep=',')
    mod1 = LinearRegression()
    df_model = df2[df2['RETAILER'].isin([i])]
    X_Model = df_model[['Log_PRICE']]
    Y_Model = df_model[['VOLUME']]
    
    #add constant to predictor variables
    X_Model = sm.add_constant(X_Model)
    
    #fit linear regression model
    model = sm.OLS(Y_Model, X_Model).fit()
    
    #view model summary
    # print(model.summary())
    
    # Beta
    coff = list(model.params)
    beta = round(coff[1],2)
    
    # p values of coff
    p_val = list(model.pvalues) 
    p_val_coff = round(p_val[1],2)
    
    # R quare
    rquare = round(model.rsquared,2)
    
    # Overall p value
    p_val_overall = round(model.f_pvalue,2)

    # regression summaru
    summary_df = pd.DataFrame(np.array([i,p_val_overall,rquare,beta,p_val_coff])).T

    # Adding to the list
    ls.append(summary_df)

In [467]:

ls[0:3]

Out[467]:

[                           0    1     2         3    4
 0  ALBANY,NY - PRICE CHOPPER  0.0  0.73  -9808.87  0.0,
                      0    1     2         3    4
 0  ATLANTA - KROGER CO  0.0  0.75  -9497.04  0.0,
                       0     1     2         3     4
 0  ATLANTA - WINN DIXIE  0.02  0.09  -6857.56  0.02]

In [469]:

final_df = pd.concat(ls)
final_df.head()

Out[469]:

0	1	2	3	4
ALBANY,NY - PRICE CHOPPER	0.0	0.73	-9808.87	0.0
ATLANTA - KROGER CO	0.0	0.75	-9497.04	0.0
ATLANTA - WINN DIXIE	0.02	0.09	-6857.56	0.02
BALTI/WASH - GIANT FOOD INC	0.0	0.83	-24919.37	0.0
BALTI/WASH - SAFEWAY	0.0	0.42	-11629.34	0.0

In [439]:

colnames = ['Retailer','P_Value_Overall','R_Square','Beta_Coefficient','P_Value_Price']
colnames

Out[439]:

['Retailer',
 'P_Value_Overall',
 'R_Square',
 'Beta_Coefficient',
 'P_Value_Price']

In [485]:

final_df.columns = colnames
final_df.sort_values('R_Square', ascending = False).head()

Out[485]:

Retailer	R_Square	Beta_Coefficient
CHICAGO - DOMINICK	0.91	-81594.61
BUFFALO/ROCHESTER - TOPS MARKETS	0.9	-59225.81
SACRAMENTO - RALEYS	0.88	-5907.86
SYRACUSE - WEGMANS	0.83	-11724.76
BALTI/WASH - GIANT FOOD INC	0.83	-24919.37

Machine Learning Made Easy

Monday, April 28, 2025

Price Elasticity Model in Python