Panel Data Regression
Parag Verma
Introduction
Cross-section data is the data collected for many entities or subject at a given point in time. For example, data of cricket matched held during 2011 ICC world cup, data for Wimbledon matches in the year 2017 and so on. It becomes a panel data when we collect observations for a cross-sectional unit over time. For instance, data of cricket matches held during world cup for the years 2003, 2007, 2011 and 2015. Another example can be data for matches held at Wimbledon for the year 2004, 2005, 2006, 2007 and 2009 (I am a Federer Fan !!!).
For this particular blog, we will look at the state wise traffic fatality rate in US between the years 1982 to 1988. This example is taken as the data is easily available and is illustrative enough for us to draw conclusions. In this particular example we will try to conclude if alcohol tax and drunk driving laws have an impact on road fatalities and ,if yes, how strong the impact is(quantifying the impact)
Step 1: Read Data
The Fatalities dataset is part of AER package. Panel data regression will be done using plm package.Libraries such as dplyr,ggplot etc are imported to assist in the modelling process. Lets import the dataset and look at the first few records
# Link to identify differences between fixed and random effects
# https://rstudio-pubs-static.s3.amazonaws.com/372492_3e05f38dd3f248e89cdedd317d603b9a.html
package.name<-c("AER","plm","stargazer","dplyr","tidyr","stats",
"ggplot2","devtools","e1071")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
# stats library has the regression functions
# ISLR package has the credit Default dataset
data(Fatalities)
df<-Fatalities
head(df)
state year spirits unemp income emppop beertax baptist mormon drinkage
1 al 1982 1.37 14.4 10544.15 50.69204 1.539379 30.3557 0.32829 19.00
2 al 1983 1.36 13.7 10732.80 52.14703 1.788991 30.3336 0.34341 19.00
3 al 1984 1.32 11.1 11108.79 54.16809 1.714286 30.3115 0.35924 19.00
4 al 1985 1.28 8.9 11332.63 55.27114 1.652542 30.2895 0.37579 19.67
5 al 1986 1.23 9.8 11661.51 56.51450 1.609907 30.2674 0.39311 21.00
6 al 1987 1.18 7.8 11944.00 57.50988 1.560000 30.2453 0.41123 21.00
dry youngdrivers miles breath jail service fatal nfatal sfatal
1 25.0063 0.211572 7233.887 no no no 839 146 99
2 22.9942 0.210768 7836.348 no no no 930 154 98
3 24.0426 0.211484 8262.990 no no no 932 165 94
4 23.6339 0.211140 8726.917 no no no 882 146 98
5 23.4647 0.213400 8952.854 no no no 1081 172 119
6 23.7924 0.215527 9166.302 no no no 1110 181 114
fatal1517 nfatal1517 fatal1820 nfatal1820 fatal2124 nfatal2124 afatal
1 53 9 99 34 120 32 309.438
2 71 8 108 26 124 35 341.834
3 49 7 103 25 118 34 304.872
4 66 9 100 23 114 45 276.742
5 82 10 120 23 119 29 360.716
6 94 11 127 31 138 30 368.421
pop pop1517 pop1820 pop2124 milestot unempus emppopus gsp
1 3942002 208999.6 221553.4 290000.1 28516 9.7 57.8 -0.02212476
2 3960008 202000.1 219125.5 290000.2 31032 9.6 57.9 0.04655825
3 3988992 197000.0 216724.1 288000.2 32961 7.5 59.5 0.06279784
4 4021008 194999.7 214349.0 284000.3 35091 7.2 60.1 0.02748997
5 4049994 203999.9 212000.0 263000.3 36259 7.0 60.7 0.03214295
6 4082999 204999.8 208998.5 258999.8 37426 6.2 61.5 0.04897637
# Rows and columns
dim(df)
[1] 336 34
There are 336 records and 34 columns
# Number of states are
unique(df[["state"]])
[1] al az ar ca co ct de fl ga id il in ia ks ky la me md ma mi mn ms mo mt ne
[26] nv nh nj nm ny nc nd oh ok or pa ri sc sd tn tx ut vt va wa wv wi wy
48 Levels: al az ar ca co ct de fl ga id il in ia ks ky la me md ma mi ... wy
There are 48 states
# Number of states are
unique(df[["year"]])
[1] 1982 1983 1984 1985 1986 1987 1988
Levels: 1982 1983 1984 1985 1986 1987 1988
There is data for 7 years. So we can say that the panel consists of data points for 48 entities spread across 7 years. It is also a balanced panel as there is no missing observations.
Step 2: Analyse required columns in the data
colnames(df)
[1] "state" "year" "spirits" "unemp" "income"
[6] "emppop" "beertax" "baptist" "mormon" "drinkage"
[11] "dry" "youngdrivers" "miles" "breath" "jail"
[16] "service" "fatal" "nfatal" "sfatal" "fatal1517"
[21] "nfatal1517" "fatal1820" "nfatal1820" "fatal2124" "nfatal2124"
[26] "afatal" "pop" "pop1517" "pop1820" "pop2124"
[31] "milestot" "unempus" "emppopus" "gsp"
Lets look at the definition of some of these variables:
- unemp:state specific unemployment rate
- log(income):log of the real per capita income
- miles:state average of miles per driver
- drinkage:state specific minimum legal drinking age
apart from the above variables, we will create two more variables: drinkagec and punish with the following definition
- fatal_rate:number of fatalities per 10000 inhabitants
- drinkagec:a discretized version of drinkage that classifies states into four categories of minimal drinking age;18,19,20,21 and older. R denotes this as [18,19), [19,20), [20,21) and [21,22]. These categories are included as dummy regressors where [21,22] is chosen as the reference category
- punish:state specific unemployment rate
Lets create the fatal_rate variable
df[["fatal_rate"]] <- df%>%
mutate(fatal_rate=(fatal/pop)*10000)%>%
select(fatal_rate)%>%
pull(fatal_rate)
df[["fatal_rate"]][1:5]
[1] 2.12836 2.34848 2.33643 2.19348 2.66914
This will be used as a dependent variable in our analysis
Lets create the drinkagec variable
df[["drinkagec"]] <- cut(Fatalities[["drinkage"]],
breaks = 18:22,
include.lowest = TRUE,
right = FALSE)
df$drinkagec <- relevel(df$drinkagec, "[21,22]")# This will make the [21,22) as the first or base reference level
dummy.df<-df%>%
select(drinkage,drinkagec)
head(dummy.df,10)
drinkage drinkagec
1 19.00 [19,20)
2 19.00 [19,20)
3 19.00 [19,20)
4 19.67 [19,20)
5 21.00 [21,22]
6 21.00 [21,22]
7 21.00 [21,22]
8 19.00 [19,20)
9 19.00 [19,20)
10 19.00 [19,20)
We can see that the drinkage variable has been discretized and converted into a categorical variable
Lets create the punish variable
This will be based on whether drunk driving results in sever punishment (jail) or mandatory community service (service)
df[["punish"]] <- ifelse( (df[["jail"]]=="yes") | (df[["service"]]=="yes"),"yes","no" )
dummy.df2<-df%>%
select(jail,service,punish)
Step 3: Create data subset for the years 1982 and 1988
Here we will subset the main data to include data for yealy 1982 and 1988 only. This will help us study the difference within the state(entity) over time and the impact of legislations on drunk driving cases
# the set of observations on all variables for 1982 and 1988
df.sub <- df%>%
filter(year==1982 | year==1988)
Step 4: Run regression models
In this section we will run several models and look at the fit for each in the subsequent sections
4a: Regression of fatal_rate on beertax on the overall dataset
mod1 <- lm(fatal_rate ~ beertax, data = df)
summary(mod1)
Call:
lm(formula = fatal_rate ~ beertax, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.09060 -0.37768 -0.09436 0.28548 2.27643
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.85331 0.04357 42.539 < 2e-16 ***
beertax 0.36461 0.06217 5.865 1.08e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5437 on 334 degrees of freedom
Multiple R-squared: 0.09336, Adjusted R-squared: 0.09065
F-statistic: 34.39 on 1 and 334 DF, p-value: 1.082e-08
The impact of beer tax on fatal_rate is positive which is contrary to the expectation. This model cant be used for any generalisations. Lets include state fixed effects into the model and see if there is any improvement
4b: Panel data Regression of fatal_rate on beertax with State fixed effects on the overall dataset
mod2 <- plm(fatal_rate ~ beertax + state, data = df)
summary(mod2)
Oneway (individual) effect Within Model
Call:
plm(formula = fatal_rate ~ beertax + state, data = df)
Balanced Panel: n = 48, T = 7, N = 336
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.5869619 -0.0828376 -0.0012701 0.0795454 0.8977960
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
beertax -0.65587 0.18785 -3.4915 0.000556 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 10.785
Residual Sum of Squares: 10.345
R-Squared: 0.040745
Adj. R-Squared: -0.11969
F-statistic: 12.1904 on 1 and 287 DF, p-value: 0.00055597
Coefficient of beer tax comes out as negative which is as per expectations. However, R2 is still very small around 4%
4c: Panel data Regression with State and time fixed effects on the overall dataset
mod3 <- plm(fatal_rate ~ beertax + state + year,
index=c("state","year"),
model = "within",
effect = "twoways",
data = df)
summary(mod3)
Twoways effects Within Model
Call:
plm(formula = fatal_rate ~ beertax + state + year, data = df,
effect = "twoways", model = "within", index = c("state",
"year"))
Balanced Panel: n = 48, T = 7, N = 336
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.5955623 -0.0809619 0.0014301 0.0823356 0.8388350
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
beertax -0.63998 0.19738 -3.2424 0.001328 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 10.29
Residual Sum of Squares: 9.9193
R-Squared: 0.036065
Adj. R-Squared: -0.14918
F-statistic: 10.5133 on 1 and 281 DF, p-value: 0.001328
Coefficient of beer tax comes out as negative which is as per expectations. However, R2 is still very small around 4%
4d: Panel data Regression with State and time fixed effects:Include drinkargec,punish, miles, unemployment,income details into the model. Here state and year is included in the covariate also(model equation)
mod4 <- plm(fatal_rate ~ beertax + state + year + drinkagec
+ punish + miles + unemp + log(income),
index=c("state","year"),
model = "within",
effect = "twoways",
data = df)
summary(mod4)
Twoways effects Within Model
Call:
plm(formula = fatal_rate ~ beertax + state + year + drinkagec +
punish + miles + unemp + log(income), data = df, effect = "twoways",
model = "within", index = c("state", "year"))
Unbalanced Panel: n = 48, T = 6-7, N = 335
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.4298676 -0.0760229 0.0046877 0.0835909 0.4932074
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
beertax -4.4534e-01 1.6874e-01 -2.6392 0.008788 **
drinkagec[18,19) 2.8446e-02 6.5755e-02 0.4326 0.665648
drinkagec[19,20) -1.7953e-02 3.9940e-02 -0.4495 0.653427
drinkagec[20,21) 3.2021e-02 4.5108e-02 0.7099 0.478383
punishyes 3.8331e-02 5.9861e-02 0.6403 0.522489
miles 8.2281e-06 8.8215e-06 0.9327 0.351780
unemp -6.3260e-02 1.1159e-02 -5.6689 3.651e-08 ***
log(income) 1.8158e+00 3.8014e-01 4.7766 2.914e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 10.289
Residual Sum of Squares: 6.5853
R-Squared: 0.35999
Adj. R-Squared: 0.21698
F-statistic: 19.1942 on 8 and 273 DF, p-value: < 2.22e-16
There is an improvement in R2 as it comes to around 35%
4e: Panel data Regression with sligth variation in mod4
mod5 <- plm(fatal_rate ~ beertax + state + year + drinkagec
+ punish + miles,
index=c("state","year"),
model = "within",
effect = "twoways",
data = df)
summary(mod5)
Twoways effects Within Model
Call:
plm(formula = fatal_rate ~ beertax + state + year + drinkagec +
punish + miles, data = df, effect = "twoways", model = "within",
index = c("state", "year"))
Unbalanced Panel: n = 48, T = 6-7, N = 335
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.5783538 -0.0882132 0.0045584 0.0843095 0.8080503
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
beertax -6.8992e-01 2.0029e-01 -3.4446 0.0006612 ***
drinkagec[18,19) -1.0372e-02 7.8987e-02 -0.1313 0.8956248
drinkagec[19,20) -7.5702e-02 4.7236e-02 -1.6026 0.1101620
drinkagec[20,21) -1.0015e-01 5.1362e-02 -1.9498 0.0522182 .
punishyes 8.5284e-02 7.1725e-02 1.1891 0.2354448
miles 1.7436e-05 1.0568e-05 1.6499 0.1001135
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 10.289
Residual Sum of Squares: 9.6077
R-Squared: 0.066246
Adj. R-Squared: -0.13409
F-statistic: 3.25166 on 6 and 275 DF, p-value: 0.0041891
R2 reduces to 6%
4f: Panel data Regression:Removing state from the co-variate(Group of Independent variables)
mod6 <- plm(fatal_rate ~ beertax + year + drinkage
+ punish + miles + unemp + log(income),
index = c("state", "year"),
model = "within",
effect = "twoways",
data = df)
summary(mod6)
Twoways effects Within Model
Call:
plm(formula = fatal_rate ~ beertax + year + drinkage + punish +
miles + unemp + log(income), data = df, effect = "twoways",
model = "within", index = c("state", "year"))
Unbalanced Panel: n = 48, T = 6-7, N = 335
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.4405581 -0.0771620 0.0047531 0.0766051 0.4843013
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
beertax -4.5647e-01 1.6602e-01 -2.7495 0.006365 **
drinkage -2.1567e-03 1.7761e-02 -0.1214 0.903440
punishyes 3.8981e-02 5.9683e-02 0.6531 0.514211
miles 8.9787e-06 8.7384e-06 1.0275 0.305090
unemp -6.2694e-02 1.1097e-02 -5.6498 4.009e-08 ***
log(income) 1.7864e+00 3.6260e-01 4.9268 1.445e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 10.289
Residual Sum of Squares: 6.6183
R-Squared: 0.35678
Adj. R-Squared: 0.21878
F-statistic: 25.4229 on 6 and 275 DF, p-value: < 2.22e-16
R2 comes around 35%
4g: Panel data Regression with State and time fixed effects:Applying this on df.sub(data for first year and last year)
mod7 <- plm(fatal_rate ~ beertax + state + year + drinkage
+ punish + miles + unemp + log(income),
index = c("state", "year"),
model = "within",
effect = "twoways",
data = df.sub)
summary(mod7)
Twoways effects Within Model
Call:
plm(formula = fatal_rate ~ beertax + state + year + drinkage +
punish + miles + unemp + log(income), data = df.sub, effect = "twoways",
model = "within", index = c("state", "year"))
Unbalanced Panel: n = 48, T = 1-2, N = 95
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.268497 -0.082024 0.000000 0.082024 0.268497
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
beertax -8.2875e-01 2.8869e-01 -2.8707 0.006517 **
drinkage 9.2741e-03 3.7977e-02 0.2442 0.808322
punishyes 4.5887e-02 1.2992e-01 0.3532 0.725797
miles 1.1165e-04 6.3003e-05 1.7721 0.083993 .
unemp -8.8908e-02 2.5097e-02 -3.5426 0.001024 **
log(income) 1.2618e+00 6.3396e-01 1.9904 0.053415 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 4.0522
Residual Sum of Squares: 1.412
R-Squared: 0.65155
Adj. R-Squared: 0.18115
F-statistic: 12.4659 on 6 and 40 DF, p-value: 7.183e-08
R2 becomes 65% which is much higher than the initial models.
5: Panel Regression Summary
The sign of the estimate changes as we extend the model by both entity and time fixed effects in models (2) and (3). Furthermore R2 increases substantially as fixed effects are included in the model equation.
Final Comments
In this blog we discussed step by step approach of performing panel data regression in R on traffic fatality dataset.The impact of beer tax comes around as negative(which is as per expectations) but the rest of the state and time specific variables(punish, drinkagec etc) didnt have any impact. The model performance can be further improved if we can bring in the omitted variables
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html