Saturday, February 1, 2020

Bog 12: Cumulative Sum in R

Cumulative Sum using R


Introduction

In almost all business problems, the focus is always to look at key takeaways in any analyses.Most Academic disciplines teach a concept known as Pareto Principle which states that 80% of the effect comes from 20% of the causes(also read as cases). If we look at an analyses standpoint, it is always advisable to look at the LOW HANGIN FRUITS first. According to me, these low hanging fruits are established keeping Pareto Principle in mind.

The purpose of writing this blog is to introduce the concept of Cumulative sum in R that helps you deal with practical scenarios while working with datasets. Some of problems we face while handling data set are listed below:

  • Create cumulative sum of Sales/Gross Margin/Net Cash flow/Cases around key Attributes such as COuntry, Product Category etc
  • Identify groups/Combination that contribute to 80% of the Impact(Sales, Gross Margin etc)
  • Identify Products that can help Sales Rep achieve Sales Target
  • Quota Allocation exerise where quota is given based on relative contribution of each Product or Region

In this blog, we will look at how to proceed for each of the cases mentioned above

Installing the library: dplyr and tidyr

if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(dplyr)
}

if(!require("tidyr")){
  
  install.packages("tidyr")
}else{
  
  library(tidyr)
}


Importing the dataset

For this exercise we will look at the Cybersecurity breaches reported to the US Department of Health and Human Services.There are 1151 records in the data set with 9 variables. Our aim is to calculate the Percentage of Cases reported at each Sate at an Entity Level.Here we will look at the concept of cumulative sum in R

# Ecdat library for importing the dataset

if(!require("Ecfun")){
  
  install.packages("Ecfun",dependencies = T)
}else{
  
  library(Ecfun)
}


if(!require("Ecdat")){
  
  install.packages("Ecdat",dependencies = T)
}else{
  
  library(Ecdat)
}

data(HHSCyberSecurityBreaches)
df<-HHSCyberSecurityBreaches
head(df[,1:3])%>%
knitr::kable()
Name.of.Covered.Entity State Covered.Entity.Type
Brooke Army Medical Center TX Healthcare Provider
Mid America Kidney Stone Association, LLC MO Healthcare Provider
Alaska Department of Health and Social Services AK Healthcare Provider
Health Services for Children with Special Needs, Inc. DC Health Plan
L. Douglas Carlson, M.D. CA Healthcare Provider
David I. Cohen, MD CA Healthcare Provider

The names of the columns are:

  • Name.of.Covered.Entity:A character vector identifying the organization involved in the breach.
  • State: State abbreviation
  • Covered.Entity.Type: A factor giving the organization type of the covered entity
  • Individuals.Affected: An integer giving the number of humans whose records were compromised in the breach.
  • Breach.Submission.Date
  • Type.of.Breach: A factor giving one of 29 different combinations of 7 different breach types,
  • Location.of.Breached.Information: A factor giving one of 47 different combinations of 8 different location categories: “Desktop Computer”, “Electronic Medical Record”, “Email”, “Laptop”, “Network Server”, “Other”, “Other Portable Electronic Device”, “Paper/Films”
  • Business.Associate.Present:Logical = (Covered.Entity.Type == “Business Associate”)
  • Web.Description: A character vector giving a narrative description of the incident.


Percentage Cases within Each State from Entity

df.interim<-df%>%
  group_by(Covered.Entity.Type,State)%>%
  summarise(Total_Cases=sum(Individuals.Affected))%>%
  arrange(Covered.Entity.Type,desc(Total_Cases))%>%
  ungroup()%>%
  group_by(Covered.Entity.Type)%>%
  mutate(Perc_Contribution=round(100*Total_Cases/sum(Total_Cases),1))

knitr::kable(df.interim%>%
               head())
Covered.Entity.Type State Total_Cases Perc_Contribution
Business Associate VA 5093217 22.7
Business Associate TN 4547412 20.3
Business Associate NJ 2964033 13.2
Business Associate TX 2582619 11.5
Business Associate NY 2033804 9.1
Business Associate PR 1069166 4.8

Here we can see that the records are only related to Business Associate since I took head of the resultant data frame. Lets now look at top 6 records within each group


Top 5 records from each COnvered Entity Type

Top5<-df.interim%>%
  mutate(Index=1:n())%>%
  filter(Index<6)%>%
  select(-Index)

knitr::kable(Top5)
Covered.Entity.Type State Total_Cases Perc_Contribution
Business Associate VA 5093217 22.7
Business Associate TN 4547412 20.3
Business Associate NJ 2964033 13.2
Business Associate TX 2582619 11.5
Business Associate NY 2033804 9.1
Health Plan FL 1240326 23.0
Health Plan TN 1110698 20.6
Health Plan MT 1062509 19.7
Health Plan TX 568640 10.6
Health Plan NY 390830 7.3
Healthcare Clearing House FL 10000 56.3
Healthcare Clearing House SC 4204 23.7
Healthcare Clearing House PA 2300 13.0
Healthcare Clearing House WA 1250 7.0
Healthcare Provider IL 4335571 32.5
Healthcare Provider CA 2098208 15.7
Healthcare Provider FL 1583608 11.9
Healthcare Provider AL 977995 7.3
Healthcare Provider GA 491932 3.7


Lets now look at States contributing to 80% of the Grievances within each Entity Type

States Contributing to 80% of the Cases

I will break this into two steps.

  • First we will calcualte the Cumulative Percentage
  • Second we will filter out values less than 80%
Perc_80<-df.interim%>%
  mutate(CumulativePerc_Contribution=cumsum(Perc_Contribution))%>% # First Step
  filter(CumulativePerc_Contribution < 80) # Second Step

knitr::kable(Perc_80)
Covered.Entity.Type State Total_Cases Perc_Contribution CumulativePerc_Contribution
Business Associate VA 5093217 22.7 22.7
Business Associate TN 4547412 20.3 43.0
Business Associate NJ 2964033 13.2 56.2
Business Associate TX 2582619 11.5 67.7
Business Associate NY 2033804 9.1 76.8
Health Plan FL 1240326 23.0 23.0
Health Plan TN 1110698 20.6 43.6
Health Plan MT 1062509 19.7 63.3
Health Plan TX 568640 10.6 73.9
Healthcare Clearing House FL 10000 56.3 56.3
Healthcare Provider IL 4335571 32.5 32.5
Healthcare Provider CA 2098208 15.7 48.2
Healthcare Provider FL 1583608 11.9 60.1
Healthcare Provider AL 977995 7.3 67.4
Healthcare Provider GA 491932 3.7 71.1
Healthcare Provider TN 467261 3.5 74.6
Healthcare Provider TX 341041 2.6 77.2
Healthcare Provider NY 334068 2.5 79.7

Final Comments

We have seen how we can use group_by and cumsum function to get an idea of important things at an overall level.This helps us gain an understanding of States driving the numbers at Entity Level

Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...