Calculating Percentiles
Parag Verma
Introduction
In this Blog we will look at how to calcualte percentile summaries for cigarete consumption dataset.It is easier to understand a consumption data and hence it has been taken as a reference example for this blog.Mostly we will look at the following:
- Calculate percentiles
- Summaries in Group
- Visual Representation
- Make sense of Numbers
Installing the library: dplyr,tidyr,stringr and Ecdat package
if(!require("dplyr")){
install.packages("dplyr")
}else{
library(dplyr)
}
if(!require("tidyr")){
install.packages("tidyr")
}else{
library(tidyr)
}
if(!require("stringr")){
install.packages("stringr")
}else{
library(stringr)
}
# For downloading the Cigarette Data
if(!require("Ecdat")){
install.packages("Ecdat")
}else{
library(Ecdat)
}
data(Cigar)
df<-Cigar
dim(df)
[1] 1380 9
# For downloading the Cigarette Data
if(!require("ggplot2")){
install.packages("ggplot2")
}else{
library(ggplot2)
}
Description of the Data set
It is a panel of 46 observations from 1963 to 1992 across different states in US. The brief description of the columns are given below:
- state:state abbreviation
- price:price per pack of cigarettes
- pop : population
- pop16:population above the age of 16
- cpi:consumer price index (1983=100)
- ndi:per capita disposable income
- sales:cigarette sales in packs per capita
- pimin:minimum price in adjoining states per pack of cigarettes
We will mostly look at the sales variable and its spread across state and year
Box plot of sales across state
df$state<-as.factor(df$state)
# Basic box plot
p <- ggplot(df, aes(x=state, y=sales,fill=sales)) +
geom_boxplot(fatten=10)
p+scale_color_grey() + theme_classic()
We can see that the median sales across different states vary considerably abd fluctautes between 100 and 120.Let us also take a look at what is the distribution of IQR
Visualizing the Inter Quartile Range(IQR): Measure of Spread
interim.df<-df%>%
group_by(state)%>%
mutate(sales_iqr=IQR(sales))
p <- ggplot(interim.df, aes(x=state, y=sales,fill=sales_iqr))+
geom_boxplot(fatten=0,outlier.shape = NA, coef = 0)
p2<-p+scale_color_grey() + theme_classic()
p2
We can see that the bars with small heights have less dispersion.This means that the values are very close to each other and some kind of generalisaiton can be made. However, the bars with more height have more dispersion in the data.So we need to do some outlier correction before we make generalisation for these states
Using IQR to divide the dataset into Groups
finaldf<-interim.df%>%
mutate(Group_IQR=ifelse(sales_iqr < 11,"G1",
ifelse(sales_iqr < 23 ,"G2","G3")))
p <- ggplot(finaldf, aes(x=state, y=sales,color=Group_IQR,fill=Group_IQR))+
geom_boxplot(fatten=0,outlier.shape = NA, coef = 0)
p2<-p+ theme_classic()
p2
We can see that using the IQR, we can divide the states into groups of cluster states based on variability in sales data. This is a very important step in understanding the difference amongst states
Final Comments
The main trick here is to use the variability in the data through percentiles and create natural groups within data.
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html