Linear Regression
Parag Verma
Introduction
Regression is one of the most basic Statistical method to derive relationship between different variables. It helps us to calculate the impact of price increase on sales of a product, how a product category impacts gross margin,etc.In this blog we will look at an econometric problem and step and step approach of executing linear regression in R.
Step 1: Read Data
package.name<-c("dplyr","tidyr","stats","Ecdat","ggplot2","devtools","corrplot")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
# stats library has the regression functions
# Ecdat package has the cigarette consumption data
data(Cigar)
df<-Cigar
head(df)
state year price pop pop16 cpi ndi sales pimin
1 1 63 28.6 3383 2236.5 30.6 1558.305 93.9 26.1
2 1 64 29.8 3431 2276.7 31.0 1684.073 95.4 27.5
3 1 65 29.8 3486 2327.5 31.5 1809.842 98.5 28.9
4 1 66 31.5 3524 2369.7 32.4 1915.160 96.4 29.5
5 1 67 31.6 3533 2393.7 33.4 2023.546 95.5 29.6
6 1 68 35.6 3522 2405.2 34.8 2202.486 88.4 32.0
It is a panel of 46 observations from 1963 to 1992 across different states in US. The brief description of the columns are given below:
- state:state abbreviation
- price:price per pack of cigarettes
- pop : population
- pop16:population above the age of 16
- cpi:consumer price index (1983=100)
- ndi:per capita disposable income
- sales:cigarette sales in packs per capita
- pimin:minimum price in adjoining states per pack of cigarettes
Step 2: Scatterplot
In this exercise, we will try to find a relation between sales and price,pop16,cpi,ndi using a scatterplot
interim.df<-df%>%
select(sales,price,pop16,cpi,ndi)
nm<-expand.grid(colnames(interim.df),colnames(interim.df),stringsAsFactors = F)%>%
filter(Var1!=Var2,Var1=="sales")
lapply(nm[["Var2"]],function(z){
plot(interim.df[["sales"]], interim.df[[z]], main="Scatterplot Example",
xlab=toString(z), ylab="sales", pch=19)
abline(lm(interim.df[["sales"]]~interim.df[[z]]), col="red") # regression line (y~x)
})
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
Lets now calcualte the coorelation between continuous variables
Step 3: Correlation between Variables
cor.df<-df%>%
select(sales,price,pop16,cpi,ndi)%>%
cor()%>%
round(2)
# Visualize Correlation Matrix
corrplot(cor.df, order = "FPC", method = "circle", type = "lower", tl.cex = 0.7, tl.col = rgb(0, 0, 0),
addrect = 2,col=c("black", "orange"))
We can infer the following things from the above table:
- sales has a negative correlation with the rest of the variables.This is quite evident from the scatterplots also
- price has a high correlation with cpi ad ndi
- cpi has a high correlation with ndi
Since the correlation is high, we need to check for Multicollinearity. If Multicollinearity is present and is not removed, then the regression results will be biased
Step 4: Splitting into Test and Train Dataset
set.seed(1)
smpl<-sample(nrow(interim.df),0.70*nrow(interim.df),F)
train.df<-interim.df[smpl,]
test.df<-interim.df[-smpl,]
# Train Data set
nrow(train.df)
[1] 965
# Test Data set
nrow(test.df)
[1] 415
Step 5: Checking for Multicollinearity
In order to check for Multicollinearity,
m1<-lm(sales~.,train.df)
car::vif(m1)
price pop16 cpi ndi
12.498482 1.100044 14.685892 14.273327
We can see that cpi and ndi have the highest vif value.Lets remove it and run the regression again
interim.df2<-train.df%>%
select(-cpi,-ndi)
m2<-lm(sales~.,interim.df2)
car::vif(m2)
price pop16
1.020151 1.020151
The vif values are appropriate now as they are less than 5
Lets look at the regression summary
summary(m2)
Call:
lm(formula = sales ~ ., data = interim.df2)
Residuals:
Min 1Q Median 3Q Max
-73.229 -15.487 -3.058 9.740 162.202
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.428e+02 1.967e+00 72.612 < 2e-16 ***
price -2.330e-01 2.267e-02 -10.277 < 2e-16 ***
pop16 -7.314e-04 2.596e-04 -2.817 0.00494 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 29.99 on 962 degrees of freedom
Multiple R-squared: 0.1143, Adjusted R-squared: 0.1125
F-statistic: 62.07 on 2 and 962 DF, p-value: < 2.2e-16
Step 6: Increasing R2
In the above regression summary, we can see that R2 and adjR2 is around 10%. As far as the fit is concerned, this is supposed to be a small number.Lets now include state variable into the model and see if it improves R2
final.df<-df[smpl,]%>%
select(state,sales,price,pop16)
# Converting state into factor attribute
final.df$state<-as.factor(final.df$state)
m3<-lm(sales~.,final.df)
summary(m3)
Call:
lm(formula = sales ~ ., data = final.df)
Residuals:
Min 1Q Median 3Q Max
-70.106 -7.031 1.238 7.593 73.803
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.213e+02 3.760e+00 32.257 < 2e-16 ***
state3 4.551e+00 4.659e+00 0.977 0.328908
state4 6.985e+00 4.802e+00 1.455 0.146092
state5 -3.065e+01 1.234e+01 -2.484 0.013176 *
state7 1.070e+01 4.530e+00 2.362 0.018396 *
state8 4.516e+01 5.106e+00 8.843 < 2e-16 ***
state9 6.009e+01 5.071e+00 11.849 < 2e-16 ***
state10 7.722e+00 6.055e+00 1.275 0.202488
state11 6.041e+00 4.741e+00 1.274 0.202861
state13 -5.415e-01 4.707e+00 -0.115 0.908442
state14 2.648e+00 6.119e+00 0.433 0.665307
state15 2.946e+01 4.660e+00 6.321 4.05e-10 ***
state16 3.527e+00 4.591e+00 0.768 0.442485
state17 3.573e+00 4.842e+00 0.738 0.460764
state18 7.418e+01 5.020e+00 14.777 < 2e-16 ***
state19 1.348e+01 4.566e+00 2.952 0.003234 **
state20 2.786e+01 4.837e+00 5.760 1.15e-08 ***
state21 1.245e+01 4.676e+00 2.663 0.007878 **
state22 7.808e+00 4.665e+00 1.674 0.094540 .
state23 9.765e+00 5.613e+00 1.740 0.082255 .
state24 -1.477e+00 4.664e+00 -0.317 0.751613
state25 5.033e-01 4.909e+00 0.103 0.918367
state26 1.929e+01 4.776e+00 4.039 5.81e-05 ***
state27 4.512e+00 5.214e+00 0.865 0.387082
state28 -1.554e-01 4.760e+00 -0.033 0.973968
state29 6.968e+01 4.818e+00 14.464 < 2e-16 ***
state30 1.302e+02 4.775e+00 27.261 < 2e-16 ***
state31 4.277e+00 5.115e+00 0.836 0.403273
state32 -1.361e+01 4.926e+00 -2.763 0.005838 **
state33 -1.446e+01 9.285e+00 -1.557 0.119796
state35 -2.300e+00 5.161e+00 -0.446 0.655975
state36 3.607e+00 5.972e+00 0.604 0.546020
state37 1.168e+01 4.743e+00 2.462 0.014007 *
state39 -9.048e+00 6.607e+00 -1.369 0.171208
state40 2.854e+01 4.853e+00 5.881 5.72e-09 ***
state41 6.883e+00 4.580e+00 1.503 0.133224
state42 -2.192e+00 4.955e+00 -0.442 0.658291
state43 8.832e+00 4.740e+00 1.863 0.062741 .
state44 -1.651e+01 7.087e+00 -2.329 0.020077 *
state45 -3.832e+01 4.988e+00 -7.682 4.02e-14 ***
state46 3.637e+01 5.003e+00 7.270 7.71e-13 ***
state47 1.599e+01 4.812e+00 3.323 0.000927 ***
state48 -1.386e+01 4.664e+00 -2.971 0.003046 **
state49 9.411e+00 5.032e+00 1.870 0.061781 .
state50 -1.678e+00 4.632e+00 -0.362 0.717202
state51 2.928e+01 4.995e+00 5.861 6.41e-09 ***
price -2.502e-01 1.346e-02 -18.592 < 2e-16 ***
pop16 2.296e-03 7.547e-04 3.043 0.002411 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.11 on 917 degrees of freedom
Multiple R-squared: 0.7858, Adjusted R-squared: 0.7748
F-statistic: 71.56 on 47 and 917 DF, p-value: < 2.2e-16
We can see that after including state in the model, AdjR2 increased to 76%. This fit is acceptable.
Step 7: Diagnostic Plots
par(mfrow=c(2,2))
plot(m3)
- Residual Vs Fitted plot should look more or less random.In our plot it looks random
- Normal Q-Q plot:It will give a straight line if the errors are distributed normally, but there are certain points that deviate from the straight line
- Scale location: It should be mroe or less random.In our plot, it looks like it has a pattern
- Residual Vs Leverage:The last plot (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). We see that points 184, 185 and 809 have great influence on the model.
Step 8: Calculating MAPE on Test data set
test.df<-df[-smpl,]%>%
select(state,sales,price,pop16)
test.df$state<-as.factor(test.df$state)
PredictedSales<-predict(m3,test.df)
ActualSales<-test.df$sales
MAPE<- mean(abs((ActualSales - PredictedSales)/ActualSales))*100
print(paste0("The MAPE is around ",round(MAPE,2), "%"))
[1] "The MAPE is around 8.8%"
We can see that the MAPE is less than 10% which is really good.With the given variables, this is best fit we can get.
Final Comments
This is a very basic example from econometrics where we have analysed Sales against price and other variables. The steps described are comprehensive and should be used in all regression exercises
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html