Machine Learning Made Easy

Calculating Percentiles

Introduction

In this Blog we will look at how to calcualte percentile summaries for cigarete consumption dataset.It is easier to understand a consumption data and hence it has been taken as a reference example for this blog.Mostly we will look at the following:

Calculate percentiles
Summaries in Group
Visual Representation
Make sense of Numbers

Installing the library: dplyr,tidyr,stringr and Ecdat package

if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(dplyr)
}

if(!require("tidyr")){
  
  install.packages("tidyr")
}else{
  
  library(tidyr)
}

if(!require("stringr")){
  
  install.packages("stringr")
}else{
  
  library(stringr)
}

# For downloading the Cigarette Data
if(!require("Ecdat")){
  
  install.packages("Ecdat")
}else{
  
  library(Ecdat)
}

data(Cigar)
df<-Cigar
dim(df)

[1] 1380    9

# For downloading the Cigarette Data
if(!require("ggplot2")){
  
  install.packages("ggplot2")
}else{
  
  library(ggplot2)
}

Description of the Data set

It is a panel of 46 observations from 1963 to 1992 across different states in US. The brief description of the columns are given below:

state:state abbreviation
price:price per pack of cigarettes
pop : population
pop16:population above the age of 16
cpi:consumer price index (1983=100)
ndi:per capita disposable income
sales:cigarette sales in packs per capita
pimin:minimum price in adjoining states per pack of cigarettes

We will mostly look at the sales variable and its spread across state and year

Box plot of sales across state

df$state<-as.factor(df$state)

# Basic box plot
p <- ggplot(df, aes(x=state, y=sales,fill=sales)) + 
  geom_boxplot(fatten=10)


p+scale_color_grey() + theme_classic()

We can see that the median sales across different states vary considerably abd fluctautes between 100 and 120.Let us also take a look at what is the distribution of IQR

Visualizing the Inter Quartile Range(IQR): Measure of Spread

interim.df<-df%>%
  group_by(state)%>%
  mutate(sales_iqr=IQR(sales))

p <- ggplot(interim.df, aes(x=state, y=sales,fill=sales_iqr))+
  geom_boxplot(fatten=0,outlier.shape = NA, coef = 0)

p2<-p+scale_color_grey() + theme_classic()
p2

We can see that the bars with small heights have less dispersion.This means that the values are very close to each other and some kind of generalisaiton can be made. However, the bars with more height have more dispersion in the data.So we need to do some outlier correction before we make generalisation for these states

Using IQR to divide the dataset into Groups

finaldf<-interim.df%>%
  mutate(Group_IQR=ifelse(sales_iqr < 11,"G1",
                          ifelse(sales_iqr < 23 ,"G2","G3")))


p <- ggplot(finaldf, aes(x=state, y=sales,color=Group_IQR,fill=Group_IQR))+
  geom_boxplot(fatten=0,outlier.shape = NA, coef = 0)

p2<-p+ theme_classic()
p2

We can see that using the IQR, we can divide the states into groups of cluster states based on variability in sales data. This is a very important step in understanding the difference amongst states

Final Comments

The main trick here is to use the variability in the data through percentiles and create natural groups within data.

Link to Previous R Blogs

https://ml-withparag.com/

List of Datasets for Practise

https://hofmann.public.iastate.edu/data_in_r_sortable.html

https://vincentarelbundock.github.io/Rdatasets/datasets.html