Monday, January 6, 2020

Blog 9: Pivoting and Unpivoting in R

Pivoting/Unpivoting in R


Introduction

In certain analyses we need to do certain modifications to the data so that we can understand it better. The two most common such operations are

  • Pivoting
  • Unpivoting

In the course of this blog we will look at each with a use case to explain concepts better

Installing the library: dplyr and tidyr

if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(dplyr)
}


# Library for Pivoting and Unpivoting
if(!require("tidyr")){
  
  install.packages("tidyr")
}else{
  
  library(tidyr)
}


Importing the dataset

For this exercise we will look at the Medical expenses dataset for 5574 respondents. Our aim is to understand the effect of ‘Health’ aspects on the ‘med’ expenses made.We will see how pivoting and unpivoting can help us enhance our understanding and remove certain bottlenecks built within R

if(!require("Ecdat")){
  
  install.packages("Ecdat")
}else{
  
  library(Ecdat)
}

data(MedExp)
df<-MedExp
head(df)
        med lc idp      lpi fmde physlim ndisease    health     linc
1  62.07547  0 yes 6.907755    0      no 13.73189      good 9.528776
2   0.00000  0 yes 6.907755    0      no 13.73189 excellent 9.528776
3  27.76280  0 yes 6.907755    0      no 13.73189 excellent 9.528776
4 290.58220  0 yes 6.907755    0      no 13.73189      good 9.528776
5   0.00000  0 yes 6.109248    0      no 13.73189      good 8.538699
6   2.39521  0 yes 6.109248    0     yes 13.00000      good 8.538699
      lfam educdec      age    sex child black
1 1.386294      12 43.87748   male    no    no
2 1.386294      12 17.59138   male   yes    no
3 1.386294      12 15.49966 female   yes    no
4 1.386294      12 44.14305 female    no    no
5 1.098612      12 14.54962 female   yes    no
6 1.098612      12 16.28268 female   yes    no

The names of the columns are:

colnames(df)
 [1] "med"      "lc"       "idp"      "lpi"      "fmde"     "physlim" 
 [7] "ndisease" "health"   "linc"     "lfam"     "educdec"  "age"     
[13] "sex"      "child"    "black"   

The description of the variables that we will be discussing are:

  • med annual medical expenditures in constant dollars excluding dental and outpatient mental
  • health self-rate health (excellent,good,fair,poor)

Here only two variables have been taken as we want to understand the effect of ‘health’ variable on ‘med’. We will use the spread function(equivalent to pivoting) and see how we can simplify the data


Regression with Dummy Variable: Ahh, not so simple

If we just use lm function to compute the effect of health on med, then we would get some coefficients around ‘good’, ‘excellent’ with coefficient value of one of the level missing. This is because R internally removes one of the level to remove Dummy Variable Bias. But in doing so, making sense of the coefficient becomes difficult. Moreover, if we have to identify the effect of each level, then it becomes a problem !!. Thereis a way around this problem. We will use the spread function in tidyr as shown below

df.interim<-df%>%
  mutate(Index=1:n(),Val=1)%>%
  spread(health,Val,fill=0)

colnames(df.interim)
 [1] "med"       "lc"        "idp"       "lpi"       "fmde"     
 [6] "physlim"   "ndisease"  "linc"      "lfam"      "educdec"  
[11] "age"       "sex"       "child"     "black"     "Index"    
[16] "excellent" "good"      "fair"      "poor"     

Important things to note here are:

  • An index column was created which is a kind of Identifier. This is a column with all unique values and no duplicates.Hence it is essential to create/have a column that acts like a Key before we use the spread function
  • In case we are spreading the categorical column without explicitly having a measure column, then we will also create a Column called ‘Val’ and give it a value of 1.In this way we will get 1 and 0’s in rows wherever the ‘Key’ has values

We can see that there are four columns generated.These are

  • excellent
  • fair
  • good
  • poor


Lets look at the first few records in the dataframe

df.interim%>%
  cbind.data.frame(df%>%select(health))%>%
  select(health,excellent,good,fair,poor)%>%
  head()
     health excellent good fair poor
1      good         0    1    0    0
2 excellent         1    0    0    0
3 excellent         1    0    0    0
4      good         0    1    0    0
5      good         0    1    0    0
6      good         0    1    0    0

Thus we can verify from the above table that the PIVOTING OPERATION has been done successfully

Creating the training dataset

df.train<-df.interim%>%
  select(excellent,good,fair,med)

Note that we have not selected poor in the variable list to avoid Dummy Variable Trap


Applying Regression

model1 <- lm(med ~ . , data = df.train)
summary(model1)

Call:
lm(formula = med ~ ., data = df.train)

Residuals:
   Min     1Q Median     3Q    Max 
 -1055   -142   -108    -50  38127 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1055.48      85.11  12.401   <2e-16 ***
excellent    -935.75      86.33 -10.839   <2e-16 ***
good         -873.28      86.91 -10.048   <2e-16 ***
fair         -774.77      93.22  -8.311   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 793.9 on 5570 degrees of freedom
Multiple R-squared:  0.02269,   Adjusted R-squared:  0.02216 
F-statistic:  43.1 on 3 and 5570 DF,  p-value: < 2.2e-16


The results appear to be in sync with the understanding that a healthier person incurs less towards his medical expenses

Regression without using ‘spread’ function

model2 <- lm(med ~ health , data = df)
summary(model2)

Call:
lm(formula = med ~ health, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
 -1055   -142   -108    -50  38127 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   119.73      14.45   8.284  < 2e-16 ***
healthgood     62.47      22.78   2.743  0.00611 ** 
healthfair    160.98      40.67   3.958 7.66e-05 ***
healthpoor    935.75      86.33  10.839  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 793.9 on 5570 degrees of freedom
Multiple R-squared:  0.02269,   Adjusted R-squared:  0.02216 
F-statistic:  43.1 on 3 and 5570 DF,  p-value: < 2.2e-16


The problem with the above result is that R has internally made ‘excellent’ level as the base level.It poses two problems

  • We cant estimate impact of ‘excellent’ ratings on med expenses
  • Some of the effects are positive(betas).This doesnt make sense as good rating on health is normally negatively related to med expense.Moreover it is difficult to interpret each coefficients


Gather Operation: Unpivoting (Wide to Long format)

We pivoted the medical expenses dataset to ease the Regression analysis but normally while dealing with a dataframe it makes sense to have unpivoted dataset because of the following reasons:

  • Pivoted form distributes the information over several columns.In order to consolidate data, we need unpivoted form(Long Form)
  • Lets say we have to do frequency profiling of the data.It makes more sense for us to have a single column(Health) instead of 4 separate columns

In order to do unpivoting, we use gather function in R.An example is shown below

df.long<-df.interim%>%
  gather(key='Health',value='Values',excellent:poor)

df.long%>%
  select(Health,Values)%>%
  head()
     Health Values
1 excellent      0
2 excellent      1
3 excellent      1
4 excellent      0
5 excellent      0
6 excellent      0

One thing to note here is that records in the original dataset was 5574.However the number of records in df.long is 22296 which is 4 times 5574. This is due to the fact that gather function pads the dataset with 0 ‘Values’ for all levels of Health column even if the record was not present in the original dataset

Final Comments

In this blog, we saw the various a case scenario where spread function was used in regression exercise. This is very important function which can help us understand the impact of each unique level within a categorical variable. This process is specially important in Feature Extraction excercise where we have to select important levels/attributes/indicators affecting the target variable(which can be Sales, Profit, Gross Margin etc)

Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...