Machine Learning Made Easy: Blog 31: Panel Data Regression in R

Panel Data Regression

Introduction

Cross-section data is the data collected for many entities or subject at a given point in time. For example, data of cricket matched held during 2011 ICC world cup, data for Wimbledon matches in the year 2017 and so on. It becomes a panel data when we collect observations for a cross-sectional unit over time. For instance, data of cricket matches held during world cup for the years 2003, 2007, 2011 and 2015. Another example can be data for matches held at Wimbledon for the year 2004, 2005, 2006, 2007 and 2009 (I am a Federer Fan !!!).

For this particular blog, we will look at the state wise traffic fatality rate in US between the years 1982 to 1988. This example is taken as the data is easily available and is illustrative enough for us to draw conclusions. In this particular example we will try to conclude if alcohol tax and drunk driving laws have an impact on road fatalities and ,if yes, how strong the impact is(quantifying the impact)

Step 1: Read Data

The Fatalities dataset is part of AER package. Panel data regression will be done using plm package.Libraries such as dplyr,ggplot etc are imported to assist in the modelling process. Lets import the dataset and look at the first few records

# Link to identify differences between fixed and random effects
# https://rstudio-pubs-static.s3.amazonaws.com/372492_3e05f38dd3f248e89cdedd317d603b9a.html

package.name<-c("AER","plm","stargazer","dplyr","tidyr","stats",
                "ggplot2","devtools","e1071")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

# stats library has the regression functions

# ISLR package has the credit Default dataset
data(Fatalities)
df<-Fatalities
head(df)

  state year spirits unemp   income   emppop  beertax baptist  mormon drinkage
1    al 1982    1.37  14.4 10544.15 50.69204 1.539379 30.3557 0.32829    19.00
2    al 1983    1.36  13.7 10732.80 52.14703 1.788991 30.3336 0.34341    19.00
3    al 1984    1.32  11.1 11108.79 54.16809 1.714286 30.3115 0.35924    19.00
4    al 1985    1.28   8.9 11332.63 55.27114 1.652542 30.2895 0.37579    19.67
5    al 1986    1.23   9.8 11661.51 56.51450 1.609907 30.2674 0.39311    21.00
6    al 1987    1.18   7.8 11944.00 57.50988 1.560000 30.2453 0.41123    21.00
      dry youngdrivers    miles breath jail service fatal nfatal sfatal
1 25.0063     0.211572 7233.887     no   no      no   839    146     99
2 22.9942     0.210768 7836.348     no   no      no   930    154     98
3 24.0426     0.211484 8262.990     no   no      no   932    165     94
4 23.6339     0.211140 8726.917     no   no      no   882    146     98
5 23.4647     0.213400 8952.854     no   no      no  1081    172    119
6 23.7924     0.215527 9166.302     no   no      no  1110    181    114
  fatal1517 nfatal1517 fatal1820 nfatal1820 fatal2124 nfatal2124  afatal
1        53          9        99         34       120         32 309.438
2        71          8       108         26       124         35 341.834
3        49          7       103         25       118         34 304.872
4        66          9       100         23       114         45 276.742
5        82         10       120         23       119         29 360.716
6        94         11       127         31       138         30 368.421
      pop  pop1517  pop1820  pop2124 milestot unempus emppopus         gsp
1 3942002 208999.6 221553.4 290000.1    28516     9.7     57.8 -0.02212476
2 3960008 202000.1 219125.5 290000.2    31032     9.6     57.9  0.04655825
3 3988992 197000.0 216724.1 288000.2    32961     7.5     59.5  0.06279784
4 4021008 194999.7 214349.0 284000.3    35091     7.2     60.1  0.02748997
5 4049994 203999.9 212000.0 263000.3    36259     7.0     60.7  0.03214295
6 4082999 204999.8 208998.5 258999.8    37426     6.2     61.5  0.04897637

# Rows and columns
dim(df)

[1] 336  34

There are 336 records and 34 columns

# Number of states are
unique(df[["state"]])

 [1] al az ar ca co ct de fl ga id il in ia ks ky la me md ma mi mn ms mo mt ne
[26] nv nh nj nm ny nc nd oh ok or pa ri sc sd tn tx ut vt va wa wv wi wy
48 Levels: al az ar ca co ct de fl ga id il in ia ks ky la me md ma mi ... wy

There are 48 states

# Number of states are
unique(df[["year"]])

[1] 1982 1983 1984 1985 1986 1987 1988
Levels: 1982 1983 1984 1985 1986 1987 1988

There is data for 7 years. So we can say that the panel consists of data points for 48 entities spread across 7 years. It is also a balanced panel as there is no missing observations.

Step 2: Analyse required columns in the data

colnames(df)

 [1] "state"        "year"         "spirits"      "unemp"        "income"      
 [6] "emppop"       "beertax"      "baptist"      "mormon"       "drinkage"    
[11] "dry"          "youngdrivers" "miles"        "breath"       "jail"        
[16] "service"      "fatal"        "nfatal"       "sfatal"       "fatal1517"   
[21] "nfatal1517"   "fatal1820"    "nfatal1820"   "fatal2124"    "nfatal2124"  
[26] "afatal"       "pop"          "pop1517"      "pop1820"      "pop2124"     
[31] "milestot"     "unempus"      "emppopus"     "gsp"

Lets look at the definition of some of these variables:

unemp:state specific unemployment rate
log(income):log of the real per capita income
miles:state average of miles per driver
drinkage:state specific minimum legal drinking age

apart from the above variables, we will create two more variables: drinkagec and punish with the following definition

fatal_rate:number of fatalities per 10000 inhabitants
drinkagec:a discretized version of drinkage that classifies states into four categories of minimal drinking age;18,19,20,21 and older. R denotes this as [18,19), [19,20), [20,21) and [21,22]. These categories are included as dummy regressors where [21,22] is chosen as the reference category
punish:state specific unemployment rate

Lets create the fatal_rate variable

df[["fatal_rate"]] <- df%>%
                    mutate(fatal_rate=(fatal/pop)*10000)%>%
                    select(fatal_rate)%>%
                    pull(fatal_rate)

df[["fatal_rate"]][1:5]

[1] 2.12836 2.34848 2.33643 2.19348 2.66914

This will be used as a dependent variable in our analysis

Lets create the drinkagec variable

df[["drinkagec"]] <- cut(Fatalities[["drinkage"]],
                            breaks = 18:22, 
                            include.lowest = TRUE, 
                            right = FALSE)

df$drinkagec <- relevel(df$drinkagec, "[21,22]")# This will make the [21,22) as the first or base reference level

dummy.df<-df%>%
  select(drinkage,drinkagec)

head(dummy.df,10)

   drinkage drinkagec
1     19.00   [19,20)
2     19.00   [19,20)
3     19.00   [19,20)
4     19.67   [19,20)
5     21.00   [21,22]
6     21.00   [21,22]
7     21.00   [21,22]
8     19.00   [19,20)
9     19.00   [19,20)
10    19.00   [19,20)

We can see that the drinkage variable has been discretized and converted into a categorical variable

Lets create the punish variable

This will be based on whether drunk driving results in sever punishment (jail) or mandatory community service (service)

df[["punish"]] <- ifelse( (df[["jail"]]=="yes") | (df[["service"]]=="yes"),"yes","no" )
  
dummy.df2<-df%>%
  select(jail,service,punish)

Step 3: Create data subset for the years 1982 and 1988

Here we will subset the main data to include data for yealy 1982 and 1988 only. This will help us study the difference within the state(entity) over time and the impact of legislations on drunk driving cases

# the set of observations on all variables for 1982 and 1988
df.sub <- df%>%
  filter(year==1982 | year==1988)

Step 4: Run regression models

In this section we will run several models and look at the fit for each in the subsequent sections

4a: Regression of fatal_rate on beertax on the overall dataset

mod1 <- lm(fatal_rate ~ beertax, data = df)
summary(mod1)


Call:
lm(formula = fatal_rate ~ beertax, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.09060 -0.37768 -0.09436  0.28548  2.27643 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.85331    0.04357  42.539  < 2e-16 ***
beertax      0.36461    0.06217   5.865 1.08e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5437 on 334 degrees of freedom
Multiple R-squared:  0.09336,   Adjusted R-squared:  0.09065 
F-statistic: 34.39 on 1 and 334 DF,  p-value: 1.082e-08

The impact of beer tax on fatal_rate is positive which is contrary to the expectation. This model cant be used for any generalisations. Lets include state fixed effects into the model and see if there is any improvement

4b: Panel data Regression of fatal_rate on beertax with State fixed effects on the overall dataset

mod2 <- plm(fatal_rate ~ beertax + state, data = df)
summary(mod2)

Oneway (individual) effect Within Model

Call:
plm(formula = fatal_rate ~ beertax + state, data = df)

Balanced Panel: n = 48, T = 7, N = 336

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-0.5869619 -0.0828376 -0.0012701  0.0795454  0.8977960 

Coefficients:
        Estimate Std. Error t-value Pr(>|t|)    
beertax -0.65587    0.18785 -3.4915 0.000556 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    10.785
Residual Sum of Squares: 10.345
R-Squared:      0.040745
Adj. R-Squared: -0.11969
F-statistic: 12.1904 on 1 and 287 DF, p-value: 0.00055597

Coefficient of beer tax comes out as negative which is as per expectations. However, R2 is still very small around 4%

4c: Panel data Regression with State and time fixed effects on the overall dataset

mod3 <- plm(fatal_rate ~ beertax + state + year,
            index=c("state","year"),
            model = "within",
            effect = "twoways",
            data = df)
summary(mod3)

Twoways effects Within Model

Call:
plm(formula = fatal_rate ~ beertax + state + year, data = df, 
    effect = "twoways", model = "within", index = c("state", 
        "year"))

Balanced Panel: n = 48, T = 7, N = 336

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-0.5955623 -0.0809619  0.0014301  0.0823356  0.8388350 

Coefficients:
        Estimate Std. Error t-value Pr(>|t|)   
beertax -0.63998    0.19738 -3.2424 0.001328 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    10.29
Residual Sum of Squares: 9.9193
R-Squared:      0.036065
Adj. R-Squared: -0.14918
F-statistic: 10.5133 on 1 and 281 DF, p-value: 0.001328

Coefficient of beer tax comes out as negative which is as per expectations. However, R2 is still very small around 4%

4d: Panel data Regression with State and time fixed effects:Include drinkargec,punish, miles, unemployment,income details into the model. Here state and year is included in the covariate also(model equation)

mod4 <- plm(fatal_rate ~ beertax + state + year + drinkagec 
                       + punish + miles + unemp + log(income),
            index=c("state","year"),
            model = "within",
            effect = "twoways",
            data = df)
summary(mod4)

Twoways effects Within Model

Call:
plm(formula = fatal_rate ~ beertax + state + year + drinkagec + 
    punish + miles + unemp + log(income), data = df, effect = "twoways", 
    model = "within", index = c("state", "year"))

Unbalanced Panel: n = 48, T = 6-7, N = 335

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-0.4298676 -0.0760229  0.0046877  0.0835909  0.4932074 

Coefficients:
                    Estimate  Std. Error t-value  Pr(>|t|)    
beertax          -4.4534e-01  1.6874e-01 -2.6392  0.008788 ** 
drinkagec[18,19)  2.8446e-02  6.5755e-02  0.4326  0.665648    
drinkagec[19,20) -1.7953e-02  3.9940e-02 -0.4495  0.653427    
drinkagec[20,21)  3.2021e-02  4.5108e-02  0.7099  0.478383    
punishyes         3.8331e-02  5.9861e-02  0.6403  0.522489    
miles             8.2281e-06  8.8215e-06  0.9327  0.351780    
unemp            -6.3260e-02  1.1159e-02 -5.6689 3.651e-08 ***
log(income)       1.8158e+00  3.8014e-01  4.7766 2.914e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    10.289
Residual Sum of Squares: 6.5853
R-Squared:      0.35999
Adj. R-Squared: 0.21698
F-statistic: 19.1942 on 8 and 273 DF, p-value: < 2.22e-16

There is an improvement in R2 as it comes to around 35%

4e: Panel data Regression with sligth variation in mod4

mod5 <- plm(fatal_rate ~ beertax + state + year + drinkagec 
                       + punish + miles,
            index=c("state","year"),
            model = "within",
            effect = "twoways",
            data = df)
summary(mod5)

Twoways effects Within Model

Call:
plm(formula = fatal_rate ~ beertax + state + year + drinkagec + 
    punish + miles, data = df, effect = "twoways", model = "within", 
    index = c("state", "year"))

Unbalanced Panel: n = 48, T = 6-7, N = 335

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-0.5783538 -0.0882132  0.0045584  0.0843095  0.8080503 

Coefficients:
                    Estimate  Std. Error t-value  Pr(>|t|)    
beertax          -6.8992e-01  2.0029e-01 -3.4446 0.0006612 ***
drinkagec[18,19) -1.0372e-02  7.8987e-02 -0.1313 0.8956248    
drinkagec[19,20) -7.5702e-02  4.7236e-02 -1.6026 0.1101620    
drinkagec[20,21) -1.0015e-01  5.1362e-02 -1.9498 0.0522182 .  
punishyes         8.5284e-02  7.1725e-02  1.1891 0.2354448    
miles             1.7436e-05  1.0568e-05  1.6499 0.1001135    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    10.289
Residual Sum of Squares: 9.6077
R-Squared:      0.066246
Adj. R-Squared: -0.13409
F-statistic: 3.25166 on 6 and 275 DF, p-value: 0.0041891

R2 reduces to 6%

4f: Panel data Regression:Removing state from the co-variate(Group of Independent variables)

mod6 <- plm(fatal_rate ~ beertax + year + drinkage 
                       + punish + miles + unemp + log(income), 
                       index = c("state", "year"),
                       model = "within",
                       effect = "twoways",
            data = df)
summary(mod6)

Twoways effects Within Model

Call:
plm(formula = fatal_rate ~ beertax + year + drinkage + punish + 
    miles + unemp + log(income), data = df, effect = "twoways", 
    model = "within", index = c("state", "year"))

Unbalanced Panel: n = 48, T = 6-7, N = 335

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-0.4405581 -0.0771620  0.0047531  0.0766051  0.4843013 

Coefficients:
               Estimate  Std. Error t-value  Pr(>|t|)    
beertax     -4.5647e-01  1.6602e-01 -2.7495  0.006365 ** 
drinkage    -2.1567e-03  1.7761e-02 -0.1214  0.903440    
punishyes    3.8981e-02  5.9683e-02  0.6531  0.514211    
miles        8.9787e-06  8.7384e-06  1.0275  0.305090    
unemp       -6.2694e-02  1.1097e-02 -5.6498 4.009e-08 ***
log(income)  1.7864e+00  3.6260e-01  4.9268 1.445e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    10.289
Residual Sum of Squares: 6.6183
R-Squared:      0.35678
Adj. R-Squared: 0.21878
F-statistic: 25.4229 on 6 and 275 DF, p-value: < 2.22e-16

R2 comes around 35%

4g: Panel data Regression with State and time fixed effects:Applying this on df.sub(data for first year and last year)

mod7 <- plm(fatal_rate ~ beertax + state + year + drinkage 
                       + punish + miles + unemp + log(income), 
                       index = c("state", "year"),
                       model = "within",
                       effect = "twoways",
            data = df.sub)
summary(mod7)

Twoways effects Within Model

Call:
plm(formula = fatal_rate ~ beertax + state + year + drinkage + 
    punish + miles + unemp + log(income), data = df.sub, effect = "twoways", 
    model = "within", index = c("state", "year"))

Unbalanced Panel: n = 48, T = 1-2, N = 95

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-0.268497 -0.082024  0.000000  0.082024  0.268497 

Coefficients:
               Estimate  Std. Error t-value Pr(>|t|)   
beertax     -8.2875e-01  2.8869e-01 -2.8707 0.006517 **
drinkage     9.2741e-03  3.7977e-02  0.2442 0.808322   
punishyes    4.5887e-02  1.2992e-01  0.3532 0.725797   
miles        1.1165e-04  6.3003e-05  1.7721 0.083993 . 
unemp       -8.8908e-02  2.5097e-02 -3.5426 0.001024 **
log(income)  1.2618e+00  6.3396e-01  1.9904 0.053415 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    4.0522
Residual Sum of Squares: 1.412
R-Squared:      0.65155
Adj. R-Squared: 0.18115
F-statistic: 12.4659 on 6 and 40 DF, p-value: 7.183e-08

R2 becomes 65% which is much higher than the initial models.

5: Panel Regression Summary

The sign of the estimate changes as we extend the model by both entity and time fixed effects in models (2) and (3). Furthermore R2 increases substantially as fixed effects are included in the model equation.

Final Comments

In this blog we discussed step by step approach of performing panel data regression in R on traffic fatality dataset.The impact of beer tax comes around as negative(which is as per expectations) but the rest of the state and time specific variables(punish, drinkagec etc) didnt have any impact. The model performance can be further improved if we can bring in the omitted variables

Link to Previous R Blogs

https://ml-withparag.com/

List of Datasets for Practise

https://hofmann.public.iastate.edu/data_in_r_sortable.html

https://vincentarelbundock.github.io/Rdatasets/datasets.html

Machine Learning Made Easy

Sunday, August 16, 2020

Blog 31: Panel Data Regression in R