Machine Learning Made Easy

Tips to Generate Plots in a Data Frame in R

Why should you bother?

All of Data Sciences and ML techniques rely heavily on creating a structured Data set. To be able to modify and play around a data frame is key to almost all the tasks. Right from generating intial exploratory analyses results, plots, creating cohorts etc, slicing and dicing of a data set is the most important thing a Data Scientist does. In this blog, we will look at few tips that can simplify the task of filtering a dataset.I will also share some hacks that I use while handling a data set

US Health Insurance:Dataset of Choice

Lets use the ‘HI’ dataset for our exercise. HI dataset contains detail about the Health Insurance and Hours Worked By Wives in US.

if(!require("Ecdat")){
  
  install.packages("Ecdat")
}else{
  
  library(datasets)
}

data(HI)
df<-HI
colnames(df)

 [1] "whrswk"     "hhi"        "whi"        "hhi2"       "education" 
 [6] "race"       "hispanic"   "experience" "kidslt6"    "kids618"   
[11] "husby"      "region"     "wght"

The description about each field is:
whrswk: hours worked per week by wife
hhi: wife covered by husband’s HI ?
whi: wife has HI thru her job ?
hhi2: husband has HI thru own job ?
education: a factor with levels, “<9years”, “9-11years”, “12years”, “13-15years”, “16years”, “>16years”
race: one of white, black, other
hispanic: hispanic ?
experience: years of potential work experience
kidslt6: number of kids under age of 6
kids618: number of kids 6-18 years old
husby: husband’s income in thousands of dollars
region: one of other, northcentral, south, west
wght: sampling weight

head(df[,1:5])

  whrswk hhi whi hhi2  education
1      0  no  no   no 13-15years
2     50  no yes   no 13-15years
3     40 yes  no  yes    12years
4     40  no yes  yes 13-15years
5      0 yes  no  yes  9-11years
6     40 yes yes  yes    12years

The dataset contains information related to the working hours of a Housewife and various factors associated with it. It also cotains health coverage information in the form of hhi, whi and hhi2. In the subsequent sections we will look at how we can create subset of the dataset to create meaningful cuts

Average Hours clocked by Housewife

Lets look at the average number of hours serviced by Housewives

mean(df[["whrswk"]])

[1] 25.56681

25 hours per week !. Lets look at the standard deviation

sd(df[["whrswk"]])

[1] 18.71065

18.7 is very high and it it very close to the mean itself.
Lets see how the Average Weekly Hours fairs when we bring in the ‘region’ variable

Average Hours clocked by Housewife for different region

northcentral

df$region<-as.character(df$region)
df.northcentral<-df[df[["region"]]=="northcentral",]
northcentral<-mean(df.northcentral[["whrswk"]])
northcentral

[1] 26.37935

south

df.northcentral<-df[df[["region"]]=="south",]
south<-mean(df.northcentral[["whrswk"]])
south

[1] 25.81469

west

df.west<-df[df[["region"]]=="west",]
west<-mean(df.west[["whrswk"]])
west

[1] 24.86571

other

df.other<-df[df[["region"]]=="other",]
other<-mean(df.other[["whrswk"]])
other

[1] 25.03424

This will be useful when we are trying to replace/impute the NA within the vector.

Plotting the results

final.df<-data.frame(Region=c("northcentral","south","west","other"),
                     Average.Hours=c(northcentral,south,west,other))

row.names(final.df)<-final.df$Region
final.df$Region<-NULL
final.df<-as.matrix(final.df)
barplot(t(final.df),ylab="Average Hours",xlab="Region",main="Average Hours Vs Region",col = "orange")

We can see that there is not much difference in the Average work hours accross Regions and hence it is not a good indicator of why there is a difference in hours accross the dataset. Now we will create the plot using a dynamic piece of code and analyse the results

Plotting the results:Other Features

library(dplyr)
cat.var<-c("hhi","whi","hhi2","education","race","hispanic","kidslt6","kids618")
target.var<-"whrswk"

for(i in cat.var){

df[,i]<-as.character(df[,i])
    
temp.df<-df%>%
  select(i,target.var)%>%
  mutate(Feature=i)%>%
  group_by(df[,i],Feature)%>%
  summarise(Average.Hours=mean(whrswk))%>%
  select(-Feature)

colnames(temp.df)<-c(i,"Average.Hours")
  
row.names(temp.df)<-temp.df[[i]]
temp.df[,i]<-NULL
temp.df<-as.matrix(temp.df)
barplot(t(temp.df),ylab="Average Hours",xlab=i,main=paste0("Average Hours Vs ",i),col = "orange",
        ylim=range(pretty(c(0, t(temp.df)))))

  
}

From the plots, it clear that all variables except region and hhi2 have very less impact on weekly hours. The average calculated accross unique values of region (or say hhi2) is almost same and hence is of very little importance in explaning the variance in weekly hours

Final Comments

We saw that using the dplyr library and a for loop, we can create dynamic plots to analyse the distritbution of averages on a categorical variable. This is a very powerful hack and gives us a clear understanding of which attributes to consider in subsequent model building exercise. I will create a detailed blog on dplyr and data manipulation later on ut wanted to give a flavour of how plots can be generated with ease

Link to Previous R Blogs

Blog 1-Vectors,Matrics, Lists and Data Frame in R https://mlmadeeasy.blogspot.com/2019/12/2datatypesr.html
Blog 2 - Operators in R https://mlmadeeasy.blogspot.com/2019/12/blog-2-operators-in-r.html
Blog 3 - Loops in R https://mlmadeeasy.blogspot.com/2019/12/blog-3-loops-in-r.html
Blog 4 - Indexing in R https://mlmadeeasy.blogspot.com/2019/12/blog-4-indexing-in-r.html
Blog 5- Handling NA in R https://mlmadeeasy.blogspot.com/2019/12/blog-5-handling-na-in-r.html
List of Datasets for Practise https://hofmann.public.iastate.edu/data_in_r_sortable.html