Machine Learning Made Easy

dplyr in R

Introduction to dplyr library

One of the most important reason I use R is the dplyr library. It is used for data manipulation activities like filtering, wrangling, summarisation etc on a dataset.It gives emmense power in the hand of the programer to analyse a data frame and generate various data insights from it. In the course of the blog we will look at various functions within the library and also analyse a use case to make the concept crystal clear.
Most frequently used functions in dplyr

select
filter
mutate
summarise
groupby
arrange
n()

Installing the library: dplyr and ggplot2

if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(dplyr)
}


if(!require("stringr")){
  
  install.packages("stringr")
}else{
  
  library(stringr)
}

Introduction to Obamacare:Affordable Care Act

Affordable Care Act(ACA also known as Obamacare,was introduced with the intention to reduce the cost of Healthcare especially Medicare and Medicaid in March 2010.It called for better coordination between various elements of value chain such as Providers, Doctors and Pharmacist.Al records were supposed to be electronically stored for better patient management. The whole emphasis was on delivering quality care and not the number of tests the doctor prescribed. The cost reduction was to be facilitated by the creation of a Marketplace where providers would offer healthcare services and buyer would have the option of selecting a suitable plan. The analyses presented below is based on the voting data on the the Bill

Importing the dataset

For this exercise we will look at the Congressional Votes on a Health Insurance Bill. This bill is related to the Obama Healthcare Act and represents certain additional information related to uninsured percentage for non-senior citizens. The datasets is available in ‘Stat2Data’ package by the names InsuranceVote

if(!require("Stat2Data")){
  
  install.packages("Stat2Data")
}else{
  
  library(Stat2Data)
}

data(InsuranceVote)
df<-InsuranceVote
head(df)

  Party Dist. InsVote Rep Dem Private Public Uninsured Obama
1     R  AL-1       0   1   0   0.663  0.178     0.194     0
2     D  AL-2       0   0   1   0.693  0.204     0.145     0
3     R  AL-3       0   1   0   0.689  0.205     0.151     0
4     R  AL-4       0   1   0   0.638  0.205     0.200     0
5     D  AL-5       0   0   1   0.733  0.142     0.155     0
6     R  AL-6       0   1   0   0.818  0.102     0.106     0

The names of the columns are:

colnames(df)

[1] "Party"     "Dist."     "InsVote"   "Rep"       "Dem"       "Private"  
[7] "Public"    "Uninsured" "Obama"

The description of the variables are:

Party Party affiliation: D=Democrat or R=Republican
Dist. Congressional district (State-Number)
InsVote Vote on the health insurance bill: 1=yes or 0=no
Rep Indicator for Republicans
Dem Indicator for Democrats
Private Percentage of non-senior citizens in district with private health insurance
Public Percentage of non-senior citizens in district with public health insurance
Uninsured Percentage of non-senior citizens in district with no health insurance
Obama District winner in 2008 presidential election: 1=Obama 0=McCain

We will look at the various functions within dplyr using certain key questions asked against the dataset

How many Republicans voted in faviour of the Bill

Here we have to select Party and InsVote columns and then filter the Party column to have only Democrats data. In this example we will also come accross what is known as the %>%(pipe) operator.It is used for passing on the result of an intermediate step to a subsequent step

df.republicans<-df%>%
  select(Party,InsVote)%>%
  filter(Party=="D",InsVote=="1")

dim(df.republicans)[1]

[1] 219

219 Republicans voted for the Bill

This result can be obtained using the group_by and summarise concept as well.We will also look at the Vote Percentage between Republicans and Democrats

df.republicans<-df%>%
  select(Party,InsVote)%>%
  group_by(Party,InsVote)%>%
  summarise(Total_Votes=n())%>%
  mutate(Total_Votes_WithinEach=sum(Total_Votes),
         Perc_Votes=round(100*Total_Votes/Total_Votes_WithinEach,2))%>%
  select(-Total_Votes_WithinEach)

df.republicans

# A tibble: 4 x 4
# Groups:   Party [2]
  Party InsVote Total_Votes Perc_Votes
  <fct>   <int>       <int>      <dbl>
1 D           0          39      15.1 
2 D           1         219      84.9 
3 R           0         176      99.4 
4 R           1           1       0.56

The following things can be seen:

85% Democrats voted in Favuor
Only 1 Republican voted for the Bill

Create a New Column about State and District

Before moving further, let us extract the State information form the data. ‘Dist.’ column cnotains values in the form of AL-1, AL-2 etc. Let us split this into two separate columns

df$Dist.<-as.character(df$Dist.)

df.interim<-df%>%
  mutate(State=sapply(Dist., function(y){str_split(y,"-")[[1]][1]}),
         District=sapply(Dist., function(y){str_split(y,"-")[[1]][2]}))

df.interim%>%
  select(Dist.,Party,State,District)%>%
  head(10)

   Dist. Party State District
1   AL-1     R    AL        1
2   AL-2     D    AL        2
3   AL-3     R    AL        3
4   AL-4     R    AL        4
5   AL-5     D    AL        5
6   AL-6     R    AL        6
7   AL-7     D    AL        7
8   AK-1     R    AK        1
9   AZ-1     D    AZ        1
10  AZ-2     R    AZ        2

Top 2 Districts with the most Uninsured within each State

Usually, in almost all the cases, we are required to find the top 2,3 records within each group. These groups can be identified based on Insured % or Uninsured % and so on. We will look at a very simple method of achieving this

df.interim%>%
  select(State,District,Uninsured)%>%
  group_by(State)%>%
  mutate(Index=1:n())%>%
  arrange(State,desc(Uninsured),Index)%>%
  mutate(Index2=1:n())%>%
  select(-Index)%>%
  filter(Index2 < 3)%>%
  select(-Index2)%>%
  head(10)

# A tibble: 10 x 3
# Groups:   State [6]
   State District Uninsured
   <chr> <chr>        <dbl>
 1 AK    1            0.214
 2 AL    4            0.2  
 3 AL    1            0.194
 4 AR    4            0.222
 5 AR    3            0.218
 6 AZ    4            0.345
 7 AZ    1            0.26 
 8 CA    31           0.424
 9 CA    34           0.38 
10 CO    1            0.25

Average Public, Private and Uninsured % with each State

df.uninsured<-df.interim%>%
  select(State,Public,Private,Uninsured)%>%
  group_by(State)%>%
  summarise(Avg.Uninsured=mean(Uninsured))%>%
  arrange(desc(Avg.Uninsured))%>%
  mutate(Rank=1:n())

head(df.uninsured,10)

# A tibble: 10 x 3
   State Avg.Uninsured  Rank
   <chr>         <dbl> <int>
 1 TX            0.268     1
 2 FL            0.249     2
 3 NM            0.242     3
 4 NV            0.240     4
 5 OK            0.224     5
 6 AK            0.214     6
 7 MT            0.212     7
 8 AZ            0.212     8
 9 GA            0.209     9
10 AR            0.208    10

Affordable Care Act (ACA) mainly aims at targetting the Uninsured. Hence moving forth we will discuss only the Uninsured % Based on the above two tables, the following points can be drwan:

States such as TX, FL, NM etc have the highest Uninsured %.
- The same has been indicates by Rank column
TX as FL are the two most populous states
This means that in terms of absolute numbers,these two states will have more people uninsured

Now let us find out the number of district within each state of ‘df.uninsured’ and also the favuoarble number of votes from each State

df.districts<-df.interim%>%
  select(District,InsVote,Party,State)%>%
  group_by(State)%>%
  summarise(Total_Districts=n())

df.districts%>%
  head()

# A tibble: 6 x 2
  State Total_Districts
  <chr>           <int>
1 AK                  1
2 AL                  7
3 AR                  4
4 AZ                  8
5 CA                 53
6 CO                  7

Total Favourable votes from each State

df.fav.votes<-df.interim%>%
  select(District,InsVote,Party,State)%>%
  filter(InsVote==1)%>%
  group_by(State,InsVote)%>%
  summarise(Total_Votes=n())%>%
  select(-InsVote)

df.fav.votes%>%
  head()

# A tibble: 6 x 2
# Groups:   State [6]
  State Total_Votes
  <chr>       <int>
1 AR              2
2 AZ              5
3 CA             34
4 CO              4
5 CT              5
6 FL              8

Creating the State wise view of Total Districts and Favuorable Votes

 df.districts%>%
  left_join(df.fav.votes,by="State")%>%
  mutate(Prop_Votes=round(Total_Votes/Total_Districts,2))%>%
  arrange(desc(Total_Districts),Total_Votes)%>%
  left_join(df.uninsured%>%
              select(State,Rank),by="State")%>%
  head(10)

# A tibble: 10 x 5
   State Total_Districts Total_Votes Prop_Votes  Rank
   <chr>           <int>       <int>      <dbl> <int>
 1 CA                 53          34       0.64    15
 2 TX                 32          11       0.34     1
 3 NY                 29          24       0.83    35
 4 FL                 25           8       0.32     2
 5 PA                 19          10       0.53    43
 6 IL                 19          12       0.63    28
 7 OH                 18           8       0.44    33
 8 MI                 15           8       0.53    34
 9 GA                 13           4       0.31     9
10 NC                 13           5       0.38    19

The following inferences can be drawn from the above table:

The table has been arranged to give States with highest number of districts at the top
The States with high population such as California(CA),Florida(FL) and Texas(TX) for instance should have favourable votes. But the data shows that not all the district in these States voted in favour.This means that the house was divided on the Health reform.
- Texas for instance was also the State with the highest number of Uninsured People(Indicate by Rank column)
- The states with high number of Uninsured should have favoured the Bill

Final Comments

In this blog, we saw the various functions from the dplyr library using a small exploratory exercise on the Affordable Healthcare Act. The dataset comprised of the t=voting numbers between Republican and Democrats and how they were related to Insured/Uninsured %. Various aspects of group by, sorting and ranking were discussed in the this blog.