Tips to Generate Plots in a Data Frame in R
Parag Verma
Why should you bother?
All of Data Sciences and ML techniques rely heavily on creating a structured Data set. To be able to modify and play around a data frame is key to almost all the tasks. Right from generating intial exploratory analyses results, plots, creating cohorts etc, slicing and dicing of a data set is the most important thing a Data Scientist does. In this blog, we will look at few tips that can simplify the task of filtering a dataset.I will also share some hacks that I use while handling a data setUS Health Insurance:Dataset of Choice
Lets use the ‘HI’ dataset for our exercise. HI dataset contains detail about the Health Insurance and Hours Worked By Wives in US.if(!require("Ecdat")){
install.packages("Ecdat")
}else{
library(datasets)
}
data(HI)
df<-HI
colnames(df)
[1] "whrswk" "hhi" "whi" "hhi2" "education"
[6] "race" "hispanic" "experience" "kidslt6" "kids618"
[11] "husby" "region" "wght"
The description about each field is:whrswk: hours worked per week by wife
hhi: wife covered by husband’s HI ?
whi: wife has HI thru her job ?
hhi2: husband has HI thru own job ?
education: a factor with levels, “<9years”, “9-11years”, “12years”, “13-15years”, “16years”, “>16years”
race: one of white, black, other
hispanic: hispanic ?
experience: years of potential work experience
kidslt6: number of kids under age of 6
kids618: number of kids 6-18 years old
husby: husband’s income in thousands of dollars
region: one of other, northcentral, south, west
wght: sampling weight
head(df[,1:5])
whrswk hhi whi hhi2 education
1 0 no no no 13-15years
2 50 no yes no 13-15years
3 40 yes no yes 12years
4 40 no yes yes 13-15years
5 0 yes no yes 9-11years
6 40 yes yes yes 12years
The dataset contains information related to the working hours of a Housewife and various factors associated with it. It also cotains health coverage information in the form of hhi, whi and hhi2. In the subsequent sections we will look at how we can create subset of the dataset to create meaningful cutsAverage Hours clocked by Housewife
Lets look at the average number of hours serviced by Housewivesmean(df[["whrswk"]])
[1] 25.56681
25 hours per week !. Lets look at the standard deviationsd(df[["whrswk"]])
[1] 18.71065
18.7 is very high and it it very close to the mean itself.Lets see how the Average Weekly Hours fairs when we bring in the ‘region’ variable
Average Hours clocked by Housewife for different region
northcentraldf$region<-as.character(df$region)
df.northcentral<-df[df[["region"]]=="northcentral",]
northcentral<-mean(df.northcentral[["whrswk"]])
northcentral
[1] 26.37935
southdf.northcentral<-df[df[["region"]]=="south",]
south<-mean(df.northcentral[["whrswk"]])
south
[1] 25.81469
westdf.west<-df[df[["region"]]=="west",]
west<-mean(df.west[["whrswk"]])
west
[1] 24.86571
otherdf.other<-df[df[["region"]]=="other",]
other<-mean(df.other[["whrswk"]])
other
[1] 25.03424
This will be useful when we are trying to replace/impute the NA within the vector.Plotting the results
final.df<-data.frame(Region=c("northcentral","south","west","other"),
Average.Hours=c(northcentral,south,west,other))
row.names(final.df)<-final.df$Region
final.df$Region<-NULL
final.df<-as.matrix(final.df)
barplot(t(final.df),ylab="Average Hours",xlab="Region",main="Average Hours Vs Region",col = "orange")
We can see that there is not much difference in the Average work hours accross Regions and hence it is not a good indicator of why there is a difference in hours accross the dataset. Now we will create the plot using a dynamic piece of code and analyse the resultsPlotting the results:Other Features
library(dplyr)
cat.var<-c("hhi","whi","hhi2","education","race","hispanic","kidslt6","kids618")
target.var<-"whrswk"
for(i in cat.var){
df[,i]<-as.character(df[,i])
temp.df<-df%>%
select(i,target.var)%>%
mutate(Feature=i)%>%
group_by(df[,i],Feature)%>%
summarise(Average.Hours=mean(whrswk))%>%
select(-Feature)
colnames(temp.df)<-c(i,"Average.Hours")
row.names(temp.df)<-temp.df[[i]]
temp.df[,i]<-NULL
temp.df<-as.matrix(temp.df)
barplot(t(temp.df),ylab="Average Hours",xlab=i,main=paste0("Average Hours Vs ",i),col = "orange",
ylim=range(pretty(c(0, t(temp.df)))))
}
From the plots, it clear that all variables except region and hhi2 have very less impact on weekly hours. The average calculated accross unique values of region (or say hhi2) is almost same and hence is of very little importance in explaning the variance in weekly hours
Final Comments
We saw that using the dplyr library and a for loop, we can create dynamic plots to analyse the distritbution of averages on a categorical variable. This is a very powerful hack and gives us a clear understanding of which attributes to consider in subsequent model building exercise. I will create a detailed blog on dplyr and data manipulation later on ut wanted to give a flavour of how plots can be generated with easeLink to Previous R Blogs
Blog 1-Vectors,Matrics, Lists and Data Frame in R https://mlmadeeasy.blogspot.com/2019/12/2datatypesr.htmlBlog 2 - Operators in R https://mlmadeeasy.blogspot.com/2019/12/blog-2-operators-in-r.html
Blog 3 - Loops in R https://mlmadeeasy.blogspot.com/2019/12/blog-3-loops-in-r.html
Blog 4 - Indexing in R https://mlmadeeasy.blogspot.com/2019/12/blog-4-indexing-in-r.html
Blog 5- Handling NA in R https://mlmadeeasy.blogspot.com/2019/12/blog-5-handling-na-in-r.html
List of Datasets for Practise https://hofmann.public.iastate.edu/data_in_r_sortable.html
No comments:
Post a Comment