Pivot Multiple Columns in R

Introduction

There are certain cases where we need to create a pivot/widen the data based on multiple features.This cant be done simply and requires a two step approach.We will look at this using Medicaid dataset present in the AER package.

package.name<-c("dplyr","tidyr","stringr","AER")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

data("Medicaid1986")
df<-Medicaid1986
head(df)

  visits exposure children age income health1 health2 access married gender
1      0      100        1  24 14.500   0.495  -0.854   0.50      no female
2      1       90        3  19  6.000   0.520  -0.969   0.17      no female
3      0      106        4  17  8.377  -1.227   0.317   0.42      no female
4      0      114        2  29  6.000  -1.524   0.457   0.33      no female
5     11      115        1  26  8.500   0.173  -0.599   0.67      no female
6      3      102        1  22  6.000  -0.905   0.062   0.25      no female
  ethnicity school enroll program
1      cauc     13    yes    afdc
2      cauc     11    yes    afdc
3      cauc     12    yes    afdc
4      cauc     12    yes    afdc
5      cauc     16    yes    afdc
6     other     12    yes    afdc

The following columns are categorical in nature:

married
gender
ethnicity
enroll
program

Lets say we want to pivot the dataset based on the above features.In order to do this, we need to use the following two step approach:

Step 1: unpivot the data on the mentioned categorical data
Step 2: then pivot the data based on the key:value pair obtained

Step 1: unpivot the data on the mentioned categorical data

interim.df<-df%>%
  mutate(Index=1:n())%>%
  gather(key = "attributes",value = "values",
         c("married","gender","ethnicity","enroll","program"))%>%
  mutate(key_value=str_c(attributes ,"_",values))%>%
  select(Index,everything())%>%
  unique()

Warning: attributes are not identical across measure variables;
they will be dropped

head(interim.df)

  Index visits exposure children age income health1 health2 access school
1     1      0      100        1  24 14.500   0.495  -0.854   0.50     13
2     2      1       90        3  19  6.000   0.520  -0.969   0.17     11
3     3      0      106        4  17  8.377  -1.227   0.317   0.42     12
4     4      0      114        2  29  6.000  -1.524   0.457   0.33     12
5     5     11      115        1  26  8.500   0.173  -0.599   0.67     16
6     6      3      102        1  22  6.000  -0.905   0.062   0.25     12
  attributes values  key_value
1    married     no married_no
2    married     no married_no
3    married     no married_no
4    married     no married_no
5    married     no married_no
6    married     no married_no

Step 2: then pivot the data based on the key:value pair obtained

final.df<-interim.df%>%
  mutate(key_value=str_c(attributes ,"_",values))%>%
  mutate(TotalCount=1)%>%
  select(-attributes,-values)%>% # Most important step
  spread(key_value,TotalCount,fill=NA)
  
head(final.df)

  Index visits exposure children age income health1 health2 access school
1     1      0      100        1  24 14.500   0.495  -0.854   0.50     13
2     2      1       90        3  19  6.000   0.520  -0.969   0.17     11
3     3      0      106        4  17  8.377  -1.227   0.317   0.42     12
4     4      0      114        2  29  6.000  -1.524   0.457   0.33     12
5     5     11      115        1  26  8.500   0.173  -0.599   0.67     16
6     6      3      102        1  22  6.000  -0.905   0.062   0.25     12
  enroll_no enroll_yes ethnicity_cauc ethnicity_other gender_female gender_male
1        NA          1              1              NA             1          NA
2        NA          1              1              NA             1          NA
3        NA          1              1              NA             1          NA
4        NA          1              1              NA             1          NA
5        NA          1              1              NA             1          NA
6        NA          1             NA               1             1          NA
  married_no married_yes program_afdc program_ssi
1          1          NA            1          NA
2          1          NA            1          NA
3          1          NA            1          NA
4          1          NA            1          NA
5          1          NA            1          NA
6          1          NA            1          NA

final.df contains the pivotted dataset or the wide form dataset.

Final Note

We saw how it is extremely easy to create a pivoted or a wide form dataset using multiple keys in R

Machine Learning Made Easy

Monday, May 31, 2021

Blog 42 Pivot Using multiple Columns in R

Pivot Multiple Columns in R

Parag Verma

31st May, 2021

Introduction

Step 1: unpivot the data on the mentioned categorical data

Step 2: then pivot the data based on the key:value pair obtained

Final Note

Link to Previous Blogs

My Youtube Channel

List of Datasets for Practise

No comments:

Post a Comment

Price Elasticity Model in Python