Sunday, January 19, 2020

Blog 11: Bang Bang Operator in R

Bang Bang using R


Introduction

If you have read the title at the starting of the blog and got offended,I really appologise.I am not even doing this to increase views. The absolute fact that such a thing exists in R makes it even more intriguing. Jokes apart, the purpose of writing this blog is to introduce an amazing hack in R that helps you deal with practical scenarios while working with datasets. Some of problems we face while handling data set are listed below:


  • Need to create a variable using Mutate.The name has to be entered each and every time we are analying that variable.It leads to entering the variable name several times causing errors/unwieldy
  • Have to do a groupby but the grouping variables will be decided during the run time and cant hard code anything
  • Each process create a set of variable (Not known beforehand) and need to add ’_New’ to their names after the process

In this blog, we will look at each of the cases mentioned above to gain understanding of the bang bang operator !! (also called unquote )

Installing the library: dplyr and ggplot2

if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(dplyr)
}

if(!require("tidyr")){
  
  install.packages("tidyr")
}else{
  
  library(tidyr)
}


if(!require("stringr")){
  
  install.packages("stringr")
}else{
  
  library(stringr)
}


Importing the dataset

For this exercise we will look at the Cybersecurity breaches reported to the US Department of Health and Human Services.There are 1151 records in the data set with 9 variables. Our aim is to first look problems we might run in while doing analyses and how we can circumvent issues with the !!

# Ecdat library for importing the dataset

if(!require("Ecdat")){
  
  install.packages("Ecdat")
}else{
  
  library(Ecdat)
}

data(HHSCyberSecurityBreaches)
df<-HHSCyberSecurityBreaches
head(df[,1:3])%>%
knitr::kable()
Name.of.Covered.Entity State Covered.Entity.Type
Brooke Army Medical Center TX Healthcare Provider
Mid America Kidney Stone Association, LLC MO Healthcare Provider
Alaska Department of Health and Social Services AK Healthcare Provider
Health Services for Children with Special Needs, Inc. DC Health Plan
L. Douglas Carlson, M.D. CA Healthcare Provider
David I. Cohen, MD CA Healthcare Provider
The names of the columns are:
  • Name.of.Covered.Entity:A character vector identifying the organization involved in the breach.
  • State: State abbreviation
  • Covered.Entity.Type: A factor giving the organization type of the covered entity
  • Individuals.Affected: An integer giving the number of humans whose records were compromised in the breach.
  • Breach.Submission.Date
  • Type.of.Breach: A factor giving one of 29 different combinations of 7 different breach types,
  • Location.of.Breached.Information: A factor giving one of 47 different combinations of 8 different location categories: “Desktop Computer”, “Electronic Medical Record”, “Email”, “Laptop”, “Network Server”, “Other”, “Other Portable Electronic Device”, “Paper/Films”
  • Business.Associate.Present:Logical = (Covered.Entity.Type == “Business Associate”)
  • Web.Description: A character vector giving a narrative description of the incident.


Create Variable dynamically

Lets say we have been given a list of new variables that we need to create each time we run the dataset.Today, we have been asked to create new variable based on ‘Covered.Entity.Type’ and identify whether it is provider or not

base.variable<-"Covered.Entity.Type"

df.interim<-  df%>%
  mutate(!!paste0(base.variable,"_Provider") := str_detect(df[[base.variable]],"Provider"))%>%
  select(base.variable,!!paste0(base.variable,"_Provider"))%>%
  head()

knitr::kable(df.interim)
Covered.Entity.Type Covered.Entity.Type_Provider
Healthcare Provider TRUE
Healthcare Provider TRUE
Healthcare Provider TRUE
Health Plan FALSE
Healthcare Provider TRUE
Healthcare Provider TRUE


Create Variable dynamically:Another Variation

base.variable<-"Covered.Entity.Type"

  df%>%
  mutate(!!paste0(base.variable,"_Provider") := str_detect(df[[base.variable]],"Provider"))%>%
  select(base.variable,!!paste0(base.variable,"_Provider"))%>%
    select(-base.variable)%>%
    rename(!!base.variable := !!paste0(base.variable,"_Provider"))%>%
  head()%>%
    knitr::kable()
Covered.Entity.Type
TRUE
TRUE
TRUE
FALSE
TRUE
TRUE

We see that we have removed the Covered.Entity.Type column with original value and replaced it with newly created variable

Group by Using Multiple variables

Lets say we need to create summary based on some group by criteri.The number of variables are more than 1. Lets see how we can create that using !!

group.attributes<-c("Covered.Entity.Type","Type.of.Breach","Location.of.Breached.Information")

df%>%
  group_by(!!!syms(group.attributes))%>%
          summarise(Record.Count=n())%>%
          ungroup()%>%
  head()%>%
  knitr::kable()
Covered.Entity.Type Type.of.Breach Location.of.Breached.Information Record.Count
Business Associate Hacking/IT Incident Desktop Computer 1
Business Associate Hacking/IT Incident Network Server 21
Business Associate Hacking/IT Incident Other 2
Business Associate Hacking/IT Incident, Other Network Server 1
Business Associate Hacking/IT Incident, Theft, Unauthorized Access/Disclosure Desktop Computer, Network Server 1
Business Associate Hacking/IT Incident, Unauthorized Access/Disclosure Email 1

The trick here is the use of syms function and !!! (called “unquote-splice”, and pronounced bang-bang-bang). syms function converts the string into symbols which is required to create dynamic variable assignment in group by


Final Comments

We have seen how we can use !! and !!! to set up variables at the starting of the program and create proper programming chunks that are more dynamic and ar robust to human errors

No comments:

Post a Comment

Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...