Pivoting/Unpivoting in R
Parag Verma
Introduction
In certain analyses we need to do certain modifications to the data so that we can understand it better. The two most common such operations are
- Pivoting
- Unpivoting
In the course of this blog we will look at each with a use case to explain concepts better
Installing the library: dplyr and tidyr
if(!require("dplyr")){
install.packages("dplyr")
}else{
library(dplyr)
}
# Library for Pivoting and Unpivoting
if(!require("tidyr")){
install.packages("tidyr")
}else{
library(tidyr)
}
Importing the dataset
For this exercise we will look at the Medical expenses dataset for 5574 respondents. Our aim is to understand the effect of ‘Health’ aspects on the ‘med’ expenses made.We will see how pivoting and unpivoting can help us enhance our understanding and remove certain bottlenecks built within R
if(!require("Ecdat")){
install.packages("Ecdat")
}else{
library(Ecdat)
}
data(MedExp)
df<-MedExp
head(df)
med lc idp lpi fmde physlim ndisease health linc
1 62.07547 0 yes 6.907755 0 no 13.73189 good 9.528776
2 0.00000 0 yes 6.907755 0 no 13.73189 excellent 9.528776
3 27.76280 0 yes 6.907755 0 no 13.73189 excellent 9.528776
4 290.58220 0 yes 6.907755 0 no 13.73189 good 9.528776
5 0.00000 0 yes 6.109248 0 no 13.73189 good 8.538699
6 2.39521 0 yes 6.109248 0 yes 13.00000 good 8.538699
lfam educdec age sex child black
1 1.386294 12 43.87748 male no no
2 1.386294 12 17.59138 male yes no
3 1.386294 12 15.49966 female yes no
4 1.386294 12 44.14305 female no no
5 1.098612 12 14.54962 female yes no
6 1.098612 12 16.28268 female yes no
The names of the columns are:
colnames(df)
[1] "med" "lc" "idp" "lpi" "fmde" "physlim"
[7] "ndisease" "health" "linc" "lfam" "educdec" "age"
[13] "sex" "child" "black"
The description of the variables that we will be discussing are:
- med annual medical expenditures in constant dollars excluding dental and outpatient mental
- health self-rate health (excellent,good,fair,poor)
Here only two variables have been taken as we want to understand the effect of ‘health’ variable on ‘med’. We will use the spread function(equivalent to pivoting) and see how we can simplify the data
Regression with Dummy Variable: Ahh, not so simple
If we just use lm function to compute the effect of health on med, then we would get some coefficients around ‘good’, ‘excellent’ with coefficient value of one of the level missing. This is because R internally removes one of the level to remove Dummy Variable Bias. But in doing so, making sense of the coefficient becomes difficult. Moreover, if we have to identify the effect of each level, then it becomes a problem !!. Thereis a way around this problem. We will use the spread function in tidyr as shown below
df.interim<-df%>%
mutate(Index=1:n(),Val=1)%>%
spread(health,Val,fill=0)
colnames(df.interim)
[1] "med" "lc" "idp" "lpi" "fmde"
[6] "physlim" "ndisease" "linc" "lfam" "educdec"
[11] "age" "sex" "child" "black" "Index"
[16] "excellent" "good" "fair" "poor"
Important things to note here are:
- An index column was created which is a kind of Identifier. This is a column with all unique values and no duplicates.Hence it is essential to create/have a column that acts like a Key before we use the spread function
- In case we are spreading the categorical column without explicitly having a measure column, then we will also create a Column called ‘Val’ and give it a value of 1.In this way we will get 1 and 0’s in rows wherever the ‘Key’ has values
We can see that there are four columns generated.These are
- excellent
- fair
- good
- poor
Lets look at the first few records in the dataframe
df.interim%>%
cbind.data.frame(df%>%select(health))%>%
select(health,excellent,good,fair,poor)%>%
head()
health excellent good fair poor
1 good 0 1 0 0
2 excellent 1 0 0 0
3 excellent 1 0 0 0
4 good 0 1 0 0
5 good 0 1 0 0
6 good 0 1 0 0
Thus we can verify from the above table that the PIVOTING OPERATION has been done successfully
Creating the training dataset
df.train<-df.interim%>%
select(excellent,good,fair,med)
Note that we have not selected poor in the variable list to avoid Dummy Variable Trap
Applying Regression
model1 <- lm(med ~ . , data = df.train)
summary(model1)
Call:
lm(formula = med ~ ., data = df.train)
Residuals:
Min 1Q Median 3Q Max
-1055 -142 -108 -50 38127
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1055.48 85.11 12.401 <2e-16 ***
excellent -935.75 86.33 -10.839 <2e-16 ***
good -873.28 86.91 -10.048 <2e-16 ***
fair -774.77 93.22 -8.311 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 793.9 on 5570 degrees of freedom
Multiple R-squared: 0.02269, Adjusted R-squared: 0.02216
F-statistic: 43.1 on 3 and 5570 DF, p-value: < 2.2e-16
The results appear to be in sync with the understanding that a healthier person incurs less towards his medical expenses
Regression without using ‘spread’ function
model2 <- lm(med ~ health , data = df)
summary(model2)
Call:
lm(formula = med ~ health, data = df)
Residuals:
Min 1Q Median 3Q Max
-1055 -142 -108 -50 38127
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 119.73 14.45 8.284 < 2e-16 ***
healthgood 62.47 22.78 2.743 0.00611 **
healthfair 160.98 40.67 3.958 7.66e-05 ***
healthpoor 935.75 86.33 10.839 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 793.9 on 5570 degrees of freedom
Multiple R-squared: 0.02269, Adjusted R-squared: 0.02216
F-statistic: 43.1 on 3 and 5570 DF, p-value: < 2.2e-16
The problem with the above result is that R has internally made ‘excellent’ level as the base level.It poses two problems
- We cant estimate impact of ‘excellent’ ratings on med expenses
- Some of the effects are positive(betas).This doesnt make sense as good rating on health is normally negatively related to med expense.Moreover it is difficult to interpret each coefficients
Gather Operation: Unpivoting (Wide to Long format)
We pivoted the medical expenses dataset to ease the Regression analysis but normally while dealing with a dataframe it makes sense to have unpivoted dataset because of the following reasons:
- Pivoted form distributes the information over several columns.In order to consolidate data, we need unpivoted form(Long Form)
- Lets say we have to do frequency profiling of the data.It makes more sense for us to have a single column(Health) instead of 4 separate columns
In order to do unpivoting, we use gather function in R.An example is shown below
df.long<-df.interim%>%
gather(key='Health',value='Values',excellent:poor)
df.long%>%
select(Health,Values)%>%
head()
Health Values
1 excellent 0
2 excellent 1
3 excellent 1
4 excellent 0
5 excellent 0
6 excellent 0
One thing to note here is that records in the original dataset was 5574.However the number of records in df.long is 22296 which is 4 times 5574. This is due to the fact that gather function pads the dataset with 0 ‘Values’ for all levels of Health column even if the record was not present in the original dataset
Final Comments
In this blog, we saw the various a case scenario where spread function was used in regression exercise. This is very important function which can help us understand the impact of each unique level within a categorical variable. This process is specially important in Feature Extraction excercise where we have to select important levels/attributes/indicators affecting the target variable(which can be Sales, Profit, Gross Margin etc)
Link to Previous R Blogs
Blog 1-Vectors,Matrics, Lists and Data Frame in R https://mlmadeeasy.blogspot.com/2019/12/2datatypesr.html
Blog 2 - Operators in R https://mlmadeeasy.blogspot.com/2019/12/blog-2-operators-in-r.html
Blog 3 - Loops in R https://mlmadeeasy.blogspot.com/2019/12/blog-3-loops-in-r.html
Blog 4 - Indexing in R https://mlmadeeasy.blogspot.com/2019/12/blog-4-indexing-in-r.html
Blog 5- Handling NA in R https://mlmadeeasy.blogspot.com/2019/12/blog-5-handling-na-in-r.html
Blog 6- tips-to-generate-plots in R https://mlmadeeasy.blogspot.com/2019/12/blog-6tips-to-generate-plots.html
Blog 7- Functions in R https://mlmadeeasy.blogspot.com/2019/12/blog-7-creating-functions-in-r.html
Blog 8- dplyr in R https://mlmadeeasy.blogspot.com/2020/01/blog-8-dplyr-in-r.html
List of Datasets for Practise https://hofmann.public.iastate.edu/data_in_r_sortable.html https://vincentarelbundock.github.io/Rdatasets/datasets.html
No comments:
Post a Comment