Tuesday, March 31, 2020

Blog 18: Working with Percentiles

Calculating Percentiles


Introduction

In this Blog we will look at how to calcualte percentile summaries for cigarete consumption dataset.It is easier to understand a consumption data and hence it has been taken as a reference example for this blog.Mostly we will look at the following:

  • Calculate percentiles
  • Summaries in Group
  • Visual Representation
  • Make sense of Numbers


Installing the library: dplyr,tidyr,stringr and Ecdat package

if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(dplyr)
}

if(!require("tidyr")){
  
  install.packages("tidyr")
}else{
  
  library(tidyr)
}

if(!require("stringr")){
  
  install.packages("stringr")
}else{
  
  library(stringr)
}

# For downloading the Cigarette Data
if(!require("Ecdat")){
  
  install.packages("Ecdat")
}else{
  
  library(Ecdat)
}

data(Cigar)
df<-Cigar
dim(df)
[1] 1380    9
# For downloading the Cigarette Data
if(!require("ggplot2")){
  
  install.packages("ggplot2")
}else{
  
  library(ggplot2)
}


Description of the Data set

It is a panel of 46 observations from 1963 to 1992 across different states in US. The brief description of the columns are given below:

  • state:state abbreviation
  • price:price per pack of cigarettes
  • pop : population
  • pop16:population above the age of 16
  • cpi:consumer price index (1983=100)
  • ndi:per capita disposable income
  • sales:cigarette sales in packs per capita
  • pimin:minimum price in adjoining states per pack of cigarettes

We will mostly look at the sales variable and its spread across state and year

Box plot of sales across state

df$state<-as.factor(df$state)

# Basic box plot
p <- ggplot(df, aes(x=state, y=sales,fill=sales)) + 
  geom_boxplot(fatten=10)


p+scale_color_grey() + theme_classic()

We can see that the median sales across different states vary considerably abd fluctautes between 100 and 120.Let us also take a look at what is the distribution of IQR


Visualizing the Inter Quartile Range(IQR): Measure of Spread

interim.df<-df%>%
  group_by(state)%>%
  mutate(sales_iqr=IQR(sales))

p <- ggplot(interim.df, aes(x=state, y=sales,fill=sales_iqr))+
  geom_boxplot(fatten=0,outlier.shape = NA, coef = 0)

p2<-p+scale_color_grey() + theme_classic()
p2


We can see that the bars with small heights have less dispersion.This means that the values are very close to each other and some kind of generalisaiton can be made. However, the bars with more height have more dispersion in the data.So we need to do some outlier correction before we make generalisation for these states

Using IQR to divide the dataset into Groups

finaldf<-interim.df%>%
  mutate(Group_IQR=ifelse(sales_iqr < 11,"G1",
                          ifelse(sales_iqr < 23 ,"G2","G3")))


p <- ggplot(finaldf, aes(x=state, y=sales,color=Group_IQR,fill=Group_IQR))+
  geom_boxplot(fatten=0,outlier.shape = NA, coef = 0)

p2<-p+ theme_classic()
p2


We can see that using the IQR, we can divide the states into groups of cluster states based on variability in sales data. This is a very important step in understanding the difference amongst states

Final Comments

The main trick here is to use the variability in the data through percentiles and create natural groups within data.


Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...