Contingency Table
Parag Verma
Introduction
While analysing survey datasets, we are sometimes required to find the cross between gender and spend or say between geneder and product category. This helps in calculating the interaction effects between variables and also helps in EDA.In this blog we will look at an insurance dataset and how dplyr can be used to create contingency table
Installing the library: dplyr,tidyr and Ecdat package
package.name<-c("dplyr","tidyr","Ecdat")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
# Ecdat package has the 'Health Insurance and Hours Worked By Wives' data
data(HI)
df<-HI
head(df)
whrswk hhi whi hhi2 education race hispanic experience kidslt6 kids618
1 0 no no no 13-15years white no 13.0 2 1
2 50 no yes no 13-15years white no 24.0 0 1
3 40 yes no yes 12years white no 43.0 0 0
4 40 no yes yes 13-15years white no 17.0 0 1
5 0 yes no yes 9-11years white no 44.5 0 0
6 40 yes yes yes 12years white no 32.0 0 0
husby region wght
1 11.960 northcentral 214986
2 1.200 northcentral 210119
3 31.275 northcentral 219955
4 9.000 northcentral 210317
5 0.000 northcentral 219955
6 15.690 northcentral 208148
Insurance Vs Race
Lets look at the count of records in the distribution of whi X race
whi is an indicator of whether wife has a health insurance through her husband.Description of rest of the columns can be found from the below link
https://vincentarelbundock.github.io/Rdatasets/doc/Ecdat/HI.html
interim.df<-df%>%
select(race,whi)%>%
group_by(race,whi)%>%
summarise(CountRecords=n())%>%
spread(whi,CountRecords)%>%
as.data.frame()
interim.df
race no yes
1 white 13121 7739
2 black 715 526
3 other 125 46
Contingency table for all categorical variables
colnames(df)
[1] "whrswk" "hhi" "whi" "hhi2" "education"
[6] "race" "hispanic" "experience" "kidslt6" "kids618"
[11] "husby" "region" "wght"
varnm<-c("hhi","whi","hhi2","education","race","hispanic","kidslt6","kids618","region")
l1<-list()
test<-1
for(i in varnm){
for(j in varnm){
test<-test+1
if(i==j){
temp<-1
}else{
final.df<-df%>%
select(i,j)%>%
group_by(df[,i],df[,j])%>%
summarise(CountRecords=n())%>%
mutate(GroupedVar1=i,GroupedVar2=j)%>%
select(GroupedVar1,GroupedVar2,everything())
colnames(final.df)<-c("GroupedVar1","GroupedVar1","Var1Val","Var2Val","CountRecords")
l1[[test]]<-final.df
}
}
}
results.df<-do.call(rbind.data.frame,l1)
head(results.df)
# A tibble: 6 x 5
GroupedVar1 GroupedVar1 Var1Val Var2Val CountRecords
<chr> <chr> <fct> <fct> <int>
1 hhi whi no no 5260
2 hhi whi no yes 5959
3 hhi whi yes no 8701
4 hhi whi yes yes 2352
5 hhi hhi2 no no 8696
6 hhi hhi2 no yes 2523
results.df contains a sort of cross tab for all the categorical variable from the input data
Final Comments
This is a useful hack that helps us understand the interaction effects between different categorical variables.This come handy when we are trying to analyse survey data
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html