Sunday, April 5, 2020

Blog 22: Contingency Table

Contingency Table


Introduction

While analysing survey datasets, we are sometimes required to find the cross between gender and spend or say between geneder and product category. This helps in calculating the interaction effects between variables and also helps in EDA.In this blog we will look at an insurance dataset and how dplyr can be used to create contingency table


Installing the library: dplyr,tidyr and Ecdat package

package.name<-c("dplyr","tidyr","Ecdat")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}


# Ecdat package has the 'Health Insurance and Hours Worked By Wives' data
data(HI)
df<-HI
head(df)
  whrswk hhi whi hhi2  education  race hispanic experience kidslt6 kids618
1      0  no  no   no 13-15years white       no       13.0       2       1
2     50  no yes   no 13-15years white       no       24.0       0       1
3     40 yes  no  yes    12years white       no       43.0       0       0
4     40  no yes  yes 13-15years white       no       17.0       0       1
5      0 yes  no  yes  9-11years white       no       44.5       0       0
6     40 yes yes  yes    12years white       no       32.0       0       0
   husby       region   wght
1 11.960 northcentral 214986
2  1.200 northcentral 210119
3 31.275 northcentral 219955
4  9.000 northcentral 210317
5  0.000 northcentral 219955
6 15.690 northcentral 208148


Insurance Vs Race

Lets look at the count of records in the distribution of whi X race
whi is an indicator of whether wife has a health insurance through her husband.Description of rest of the columns can be found from the below link

https://vincentarelbundock.github.io/Rdatasets/doc/Ecdat/HI.html

interim.df<-df%>%
  select(race,whi)%>%
  group_by(race,whi)%>%
  summarise(CountRecords=n())%>%
  spread(whi,CountRecords)%>%
  as.data.frame()

interim.df
   race    no  yes
1 white 13121 7739
2 black   715  526
3 other   125   46


Contingency table for all categorical variables

colnames(df)
 [1] "whrswk"     "hhi"        "whi"        "hhi2"       "education" 
 [6] "race"       "hispanic"   "experience" "kidslt6"    "kids618"   
[11] "husby"      "region"     "wght"      
varnm<-c("hhi","whi","hhi2","education","race","hispanic","kidslt6","kids618","region")

l1<-list()
test<-1
for(i in varnm){
  
  for(j in varnm){
    
    test<-test+1
    
    if(i==j){
      
      temp<-1
    }else{
      
    final.df<-df%>%
      select(i,j)%>%
      group_by(df[,i],df[,j])%>%
      summarise(CountRecords=n())%>%
      mutate(GroupedVar1=i,GroupedVar2=j)%>%
      select(GroupedVar1,GroupedVar2,everything())
    
    colnames(final.df)<-c("GroupedVar1","GroupedVar1","Var1Val","Var2Val","CountRecords")
    
    l1[[test]]<-final.df
      
    }
    
  }
  
}

results.df<-do.call(rbind.data.frame,l1)
head(results.df)
# A tibble: 6 x 5
  GroupedVar1 GroupedVar1 Var1Val Var2Val CountRecords
  <chr>       <chr>       <fct>   <fct>          <int>
1 hhi         whi         no      no              5260
2 hhi         whi         no      yes             5959
3 hhi         whi         yes     no              8701
4 hhi         whi         yes     yes             2352
5 hhi         hhi2        no      no              8696
6 hhi         hhi2        no      yes             2523


results.df contains a sort of cross tab for all the categorical variable from the input data

Final Comments

This is a useful hack that helps us understand the interaction effects between different categorical variables.This come handy when we are trying to analyse survey data


No comments:

Post a Comment

Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...