Density Plots
Parag Verma
Introduction
Density plots are a great way to understand the distribution of the variable. This can be used to get an idea of how different levels of a categorical variable impacts a numeric variable.
Installing the library: dplyr,tidyr and Ecdat package
package.name<-c("dplyr","tidyr","Ecdat","ggplot2")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
# Ecdat package has the 'Health Insurance and Hours Worked By Wives' data
data(HI)
df<-HI
head(df)
whrswk hhi whi hhi2 education race hispanic experience kidslt6 kids618
1 0 no no no 13-15years white no 13.0 2 1
2 50 no yes no 13-15years white no 24.0 0 1
3 40 yes no yes 12years white no 43.0 0 0
4 40 no yes yes 13-15years white no 17.0 0 1
5 0 yes no yes 9-11years white no 44.5 0 0
6 40 yes yes yes 12years white no 32.0 0 0
husby region wght
1 11.960 northcentral 214986
2 1.200 northcentral 210119
3 31.275 northcentral 219955
4 9.000 northcentral 210317
5 0.000 northcentral 219955
6 15.690 northcentral 208148
Step 1:Lets look at spread of ‘husby’(husband’s income) against ‘region’
An important step here is to convert the categorical variable into factor
interim.df<-df%>%
select(region,husby)
interim.df$region<-as.factor(interim.df$region)
Step 2:Density Plot to look at distirbution of experience cut across race
p <- ggplot(interim.df, aes(x=husby)) +
geom_density(color="darkblue", fill="lightblue")+
geom_vline(aes(xintercept=mean(husby)),
color="blue", linetype="dashed", size=1)
plot(p)
We can play around with ‘fill’ of the plot when considering only a single plot.When looking at multiple plots, ‘fill’ attribute will make it difficult to interpret the plot
mean.df<-interim.df%>%
group_by(region)%>%
summarise(gp.mean=mean(husby))
p<-ggplot(interim.df, aes(x=husby, color=region)) +
geom_density()+
geom_vline(data=mean.df, aes(xintercept=gp.mean, color=region),
linetype="dashed")
plot(p)
Final Comments
We can see that the regions dont vary that much in terms of husband’s income. Hence it can be said that regions are similar as far as this indiciator is concerned
Link to Previous R Blogs
List of Datasets for Practise
https://vincentarelbundock.github.io/Rdatasets/datasets.html