Bang Bang using R
Parag Verma
Introduction
If you have read the title at the starting of the blog and got offended,I really appologise.I am not even doing this to increase views. The absolute fact that such a thing exists in R makes it even more intriguing. Jokes apart, the purpose of writing this blog is to introduce an amazing hack in R that helps you deal with practical scenarios while working with datasets. Some of problems we face while handling data set are listed below:
- Need to create a variable using Mutate.The name has to be entered each and every time we are analying that variable.It leads to entering the variable name several times causing errors/unwieldy
- Have to do a groupby but the grouping variables will be decided during the run time and cant hard code anything
- Each process create a set of variable (Not known beforehand) and need to add ’_New’ to their names after the process
In this blog, we will look at each of the cases mentioned above to gain understanding of the bang bang operator !! (also called unquote )
Installing the library: dplyr and ggplot2
if(!require("dplyr")){
install.packages("dplyr")
}else{
library(dplyr)
}
if(!require("tidyr")){
install.packages("tidyr")
}else{
library(tidyr)
}
if(!require("stringr")){
install.packages("stringr")
}else{
library(stringr)
}
Importing the dataset
For this exercise we will look at the Cybersecurity breaches reported to the US Department of Health and Human Services.There are 1151 records in the data set with 9 variables. Our aim is to first look problems we might run in while doing analyses and how we can circumvent issues with the !!
# Ecdat library for importing the dataset
if(!require("Ecdat")){
install.packages("Ecdat")
}else{
library(Ecdat)
}
data(HHSCyberSecurityBreaches)
df<-HHSCyberSecurityBreaches
head(df[,1:3])%>%
knitr::kable()
Name.of.Covered.Entity | State | Covered.Entity.Type |
---|---|---|
Brooke Army Medical Center | TX | Healthcare Provider |
Mid America Kidney Stone Association, LLC | MO | Healthcare Provider |
Alaska Department of Health and Social Services | AK | Healthcare Provider |
Health Services for Children with Special Needs, Inc. | DC | Health Plan |
L. Douglas Carlson, M.D. | CA | Healthcare Provider |
David I. Cohen, MD | CA | Healthcare Provider |
The names of the columns are: |
- Name.of.Covered.Entity:A character vector identifying the organization involved in the breach.
- State: State abbreviation
- Covered.Entity.Type: A factor giving the organization type of the covered entity
- Individuals.Affected: An integer giving the number of humans whose records were compromised in the breach.
- Breach.Submission.Date
- Type.of.Breach: A factor giving one of 29 different combinations of 7 different breach types,
- Location.of.Breached.Information: A factor giving one of 47 different combinations of 8 different location categories: “Desktop Computer”, “Electronic Medical Record”, “Email”, “Laptop”, “Network Server”, “Other”, “Other Portable Electronic Device”, “Paper/Films”
- Business.Associate.Present:Logical = (Covered.Entity.Type == “Business Associate”)
- Web.Description: A character vector giving a narrative description of the incident.
Create Variable dynamically
Lets say we have been given a list of new variables that we need to create each time we run the dataset.Today, we have been asked to create new variable based on ‘Covered.Entity.Type’ and identify whether it is provider or not
base.variable<-"Covered.Entity.Type"
df.interim<- df%>%
mutate(!!paste0(base.variable,"_Provider") := str_detect(df[[base.variable]],"Provider"))%>%
select(base.variable,!!paste0(base.variable,"_Provider"))%>%
head()
knitr::kable(df.interim)
Covered.Entity.Type | Covered.Entity.Type_Provider |
---|---|
Healthcare Provider | TRUE |
Healthcare Provider | TRUE |
Healthcare Provider | TRUE |
Health Plan | FALSE |
Healthcare Provider | TRUE |
Healthcare Provider | TRUE |
Create Variable dynamically:Another Variation
base.variable<-"Covered.Entity.Type"
df%>%
mutate(!!paste0(base.variable,"_Provider") := str_detect(df[[base.variable]],"Provider"))%>%
select(base.variable,!!paste0(base.variable,"_Provider"))%>%
select(-base.variable)%>%
rename(!!base.variable := !!paste0(base.variable,"_Provider"))%>%
head()%>%
knitr::kable()
Covered.Entity.Type |
---|
TRUE |
TRUE |
TRUE |
FALSE |
TRUE |
TRUE |
We see that we have removed the Covered.Entity.Type column with original value and replaced it with newly created variable
Group by Using Multiple variables
Lets say we need to create summary based on some group by criteri.The number of variables are more than 1. Lets see how we can create that using !!
group.attributes<-c("Covered.Entity.Type","Type.of.Breach","Location.of.Breached.Information")
df%>%
group_by(!!!syms(group.attributes))%>%
summarise(Record.Count=n())%>%
ungroup()%>%
head()%>%
knitr::kable()
Covered.Entity.Type | Type.of.Breach | Location.of.Breached.Information | Record.Count |
---|---|---|---|
Business Associate | Hacking/IT Incident | Desktop Computer | 1 |
Business Associate | Hacking/IT Incident | Network Server | 21 |
Business Associate | Hacking/IT Incident | Other | 2 |
Business Associate | Hacking/IT Incident, Other | Network Server | 1 |
Business Associate | Hacking/IT Incident, Theft, Unauthorized Access/Disclosure | Desktop Computer, Network Server | 1 |
Business Associate | Hacking/IT Incident, Unauthorized Access/Disclosure | 1 | |
The trick here is the use of syms function and !!! (called “unquote-splice”, and pronounced bang-bang-bang). syms function converts the string into symbols which is required to create dynamic variable assignment in group by
Final Comments
We have seen how we can use !! and !!! to set up variables at the starting of the program and create proper programming chunks that are more dynamic and ar robust to human errors
Link to Previous R Blogs
Blog 1-Vectors,Matrics, Lists and Data Frame in R https://mlmadeeasy.blogspot.com/2019/12/2datatypesr.html
Blog 2 - Operators in R https://mlmadeeasy.blogspot.com/2019/12/blog-2-operators-in-r.html
Blog 3 - Loops in R https://mlmadeeasy.blogspot.com/2019/12/blog-3-loops-in-r.html
Blog 4 - Indexing in R https://mlmadeeasy.blogspot.com/2019/12/blog-4-indexing-in-r.html
Blog 5- Handling NA in R https://mlmadeeasy.blogspot.com/2019/12/blog-5-handling-na-in-r.html
Blog 6- tips-to-generate-plots in R https://mlmadeeasy.blogspot.com/2019/12/blog-6tips-to-generate-plots.html
Blog 7- Functions in R https://mlmadeeasy.blogspot.com/2019/12/blog-7-creating-functions-in-r.html
Blog 8- dplyr in R https://mlmadeeasy.blogspot.com/2020/01/blog-8-dplyr-in-r.html
Blog 9- Unpivoting/Pivoting in R https://mlmadeeasy.blogspot.com/2020/01/blog-9-pivoting-and-unpivoting-in-r.html
Blog 10- Apply Family of Functions in R https://mlmadeeasy.blogspot.com/2020/01/blog-10-apply-functions-in-r.html
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html
No comments:
Post a Comment