Linear Regression

Introduction

Regression is one of the most basic Statistical method to derive relationship between different variables. It helps us to calculate the impact of price increase on sales of a product, how a product category impacts gross margin,etc.In this blog we will look at an econometric problem and step and step approach of executing linear regression in R.

Step 1: Read Data

package.name<-c("dplyr","tidyr","stats","Ecdat","ggplot2","devtools","corrplot")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

# stats library has the regression functions

# Ecdat package has the cigarette consumption data
data(Cigar)
df<-Cigar
head(df)

  state year price  pop  pop16  cpi      ndi sales pimin
1     1   63  28.6 3383 2236.5 30.6 1558.305  93.9  26.1
2     1   64  29.8 3431 2276.7 31.0 1684.073  95.4  27.5
3     1   65  29.8 3486 2327.5 31.5 1809.842  98.5  28.9
4     1   66  31.5 3524 2369.7 32.4 1915.160  96.4  29.5
5     1   67  31.6 3533 2393.7 33.4 2023.546  95.5  29.6
6     1   68  35.6 3522 2405.2 34.8 2202.486  88.4  32.0

It is a panel of 46 observations from 1963 to 1992 across different states in US. The brief description of the columns are given below:

state:state abbreviation
price:price per pack of cigarettes
pop : population
pop16:population above the age of 16
cpi:consumer price index (1983=100)
ndi:per capita disposable income
sales:cigarette sales in packs per capita
pimin:minimum price in adjoining states per pack of cigarettes

Step 2: Scatterplot

In this exercise, we will try to find a relation between sales and price,pop16,cpi,ndi using a scatterplot

interim.df<-df%>%
  select(sales,price,pop16,cpi,ndi)

nm<-expand.grid(colnames(interim.df),colnames(interim.df),stringsAsFactors = F)%>%
  filter(Var1!=Var2,Var1=="sales")


lapply(nm[["Var2"]],function(z){

plot(interim.df[["sales"]], interim.df[[z]], main="Scatterplot Example",
   xlab=toString(z), ylab="sales", pch=19)
  
abline(lm(interim.df[["sales"]]~interim.df[[z]]), col="red") # regression line (y~x)



})

[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

Lets now calcualte the coorelation between continuous variables

Step 3: Correlation between Variables

cor.df<-df%>%
  select(sales,price,pop16,cpi,ndi)%>%
  cor()%>%
  round(2)

# Visualize Correlation Matrix
corrplot(cor.df, order = "FPC", method = "circle", type = "lower", tl.cex = 0.7, tl.col = rgb(0, 0, 0),
         addrect = 2,col=c("black", "orange"))

We can infer the following things from the above table:

sales has a negative correlation with the rest of the variables.This is quite evident from the scatterplots also
price has a high correlation with cpi ad ndi
cpi has a high correlation with ndi

Since the correlation is high, we need to check for Multicollinearity. If Multicollinearity is present and is not removed, then the regression results will be biased

Step 4: Splitting into Test and Train Dataset

set.seed(1)
smpl<-sample(nrow(interim.df),0.70*nrow(interim.df),F)
train.df<-interim.df[smpl,]
test.df<-interim.df[-smpl,]

# Train Data set
nrow(train.df)

[1] 965

# Test Data set
nrow(test.df)

[1] 415

Step 5: Checking for Multicollinearity

In order to check for Multicollinearity,

m1<-lm(sales~.,train.df)
car::vif(m1)

    price     pop16       cpi       ndi 
12.498482  1.100044 14.685892 14.273327

We can see that cpi and ndi have the highest vif value.Lets remove it and run the regression again

interim.df2<-train.df%>%
  select(-cpi,-ndi)

m2<-lm(sales~.,interim.df2)
car::vif(m2)

   price    pop16 
1.020151 1.020151

The vif values are appropriate now as they are less than 5

Lets look at the regression summary

summary(m2)


Call:
lm(formula = sales ~ ., data = interim.df2)

Residuals:
    Min      1Q  Median      3Q     Max 
-73.229 -15.487  -3.058   9.740 162.202 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.428e+02  1.967e+00  72.612  < 2e-16 ***
price       -2.330e-01  2.267e-02 -10.277  < 2e-16 ***
pop16       -7.314e-04  2.596e-04  -2.817  0.00494 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.99 on 962 degrees of freedom
Multiple R-squared:  0.1143,    Adjusted R-squared:  0.1125 
F-statistic: 62.07 on 2 and 962 DF,  p-value: < 2.2e-16

Step 6: Increasing R2

In the above regression summary, we can see that R2 and adjR2 is around 10%. As far as the fit is concerned, this is supposed to be a small number.Lets now include state variable into the model and see if it improves R2

final.df<-df[smpl,]%>%
  select(state,sales,price,pop16)

# Converting state into factor attribute
final.df$state<-as.factor(final.df$state)

m3<-lm(sales~.,final.df)
summary(m3)


Call:
lm(formula = sales ~ ., data = final.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-70.106  -7.031   1.238   7.593  73.803 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.213e+02  3.760e+00  32.257  < 2e-16 ***
state3       4.551e+00  4.659e+00   0.977 0.328908    
state4       6.985e+00  4.802e+00   1.455 0.146092    
state5      -3.065e+01  1.234e+01  -2.484 0.013176 *  
state7       1.070e+01  4.530e+00   2.362 0.018396 *  
state8       4.516e+01  5.106e+00   8.843  < 2e-16 ***
state9       6.009e+01  5.071e+00  11.849  < 2e-16 ***
state10      7.722e+00  6.055e+00   1.275 0.202488    
state11      6.041e+00  4.741e+00   1.274 0.202861    
state13     -5.415e-01  4.707e+00  -0.115 0.908442    
state14      2.648e+00  6.119e+00   0.433 0.665307    
state15      2.946e+01  4.660e+00   6.321 4.05e-10 ***
state16      3.527e+00  4.591e+00   0.768 0.442485    
state17      3.573e+00  4.842e+00   0.738 0.460764    
state18      7.418e+01  5.020e+00  14.777  < 2e-16 ***
state19      1.348e+01  4.566e+00   2.952 0.003234 ** 
state20      2.786e+01  4.837e+00   5.760 1.15e-08 ***
state21      1.245e+01  4.676e+00   2.663 0.007878 ** 
state22      7.808e+00  4.665e+00   1.674 0.094540 .  
state23      9.765e+00  5.613e+00   1.740 0.082255 .  
state24     -1.477e+00  4.664e+00  -0.317 0.751613    
state25      5.033e-01  4.909e+00   0.103 0.918367    
state26      1.929e+01  4.776e+00   4.039 5.81e-05 ***
state27      4.512e+00  5.214e+00   0.865 0.387082    
state28     -1.554e-01  4.760e+00  -0.033 0.973968    
state29      6.968e+01  4.818e+00  14.464  < 2e-16 ***
state30      1.302e+02  4.775e+00  27.261  < 2e-16 ***
state31      4.277e+00  5.115e+00   0.836 0.403273    
state32     -1.361e+01  4.926e+00  -2.763 0.005838 ** 
state33     -1.446e+01  9.285e+00  -1.557 0.119796    
state35     -2.300e+00  5.161e+00  -0.446 0.655975    
state36      3.607e+00  5.972e+00   0.604 0.546020    
state37      1.168e+01  4.743e+00   2.462 0.014007 *  
state39     -9.048e+00  6.607e+00  -1.369 0.171208    
state40      2.854e+01  4.853e+00   5.881 5.72e-09 ***
state41      6.883e+00  4.580e+00   1.503 0.133224    
state42     -2.192e+00  4.955e+00  -0.442 0.658291    
state43      8.832e+00  4.740e+00   1.863 0.062741 .  
state44     -1.651e+01  7.087e+00  -2.329 0.020077 *  
state45     -3.832e+01  4.988e+00  -7.682 4.02e-14 ***
state46      3.637e+01  5.003e+00   7.270 7.71e-13 ***
state47      1.599e+01  4.812e+00   3.323 0.000927 ***
state48     -1.386e+01  4.664e+00  -2.971 0.003046 ** 
state49      9.411e+00  5.032e+00   1.870 0.061781 .  
state50     -1.678e+00  4.632e+00  -0.362 0.717202    
state51      2.928e+01  4.995e+00   5.861 6.41e-09 ***
price       -2.502e-01  1.346e-02 -18.592  < 2e-16 ***
pop16        2.296e-03  7.547e-04   3.043 0.002411 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.11 on 917 degrees of freedom
Multiple R-squared:  0.7858,    Adjusted R-squared:  0.7748 
F-statistic: 71.56 on 47 and 917 DF,  p-value: < 2.2e-16

We can see that after including state in the model, AdjR2 increased to 76%. This fit is acceptable.

Step 7: Diagnostic Plots

par(mfrow=c(2,2))
plot(m3)

Residual Vs Fitted plot should look more or less random.In our plot it looks random
Normal Q-Q plot:It will give a straight line if the errors are distributed normally, but there are certain points that deviate from the straight line
Scale location: It should be mroe or less random.In our plot, it looks like it has a pattern
Residual Vs Leverage:The last plot (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). We see that points 184, 185 and 809 have great influence on the model.

Step 8: Calculating MAPE on Test data set

test.df<-df[-smpl,]%>%
  select(state,sales,price,pop16)

test.df$state<-as.factor(test.df$state)

PredictedSales<-predict(m3,test.df)
ActualSales<-test.df$sales


MAPE<- mean(abs((ActualSales - PredictedSales)/ActualSales))*100
print(paste0("The MAPE is around ",round(MAPE,2), "%"))

[1] "The MAPE is around 8.8%"

We can see that the MAPE is less than 10% which is really good.With the given variables, this is best fit we can get.

Final Comments

This is a very basic example from econometrics where we have analysed Sales against price and other variables. The steps described are comprehensive and should be used in all regression exercises

Link to Previous R Blogs

https://ml-withparag.com/

List of Datasets for Practise

https://hofmann.public.iastate.edu/data_in_r_sortable.html

https://vincentarelbundock.github.io/Rdatasets/datasets.html

Machine Learning Made Easy

Monday, April 6, 2020

Blog 21: Linear Regression