Apply Family of Functions in R
Parag Verma
Introduction
If you have interacted with any intermediate programer in R you would know their reluctance to use loops in their program.Instead they would rely on using the apply functions.It is mostly used to manipulate slices of data from matrix, array, list and data frame.It comes very handy when we are trying to break our code into modules/section. It reduces time to debug the code as well and above all it is internally optimised to run faster
 Most frequently used apply functions in dplyr
- apply(mostly for Data Frame and Matrices)
- lapply(with list and vectors)
- sapply(with list and vectors)
- tapply
- tapply
- vapply
- rapply
We will mostly talk about apply, lapply and sapply during the course of this blog. We will look at practical problem and use cases where they can be applied 
Installing the library: dplyr and ggplot2
if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(dplyr)
}
if(!require("stringr")){
  
  install.packages("stringr")
}else{
  
  library(stringr)
}Importing the dataset
For this exercise we will look at the Medical expenses dataset for 5574 respondents. Our aim is to first look at the data types of each column and then use the apply functions to derive certain insights from the data set
if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(dplyr)
}
if(!require("tidyr")){
  
  install.packages("tidyr")
}else{
  
  library(tidyr)
}
# Ecdat library for importing the dataset
if(!require("Ecdat")){
  
  install.packages("Ecdat")
}else{
  
  library(Ecdat)
}
data(MedExp)
df<-MedExp
head(df)        med lc idp      lpi fmde physlim ndisease    health     linc
1  62.07547  0 yes 6.907755    0      no 13.73189      good 9.528776
2   0.00000  0 yes 6.907755    0      no 13.73189 excellent 9.528776
3  27.76280  0 yes 6.907755    0      no 13.73189 excellent 9.528776
4 290.58220  0 yes 6.907755    0      no 13.73189      good 9.528776
5   0.00000  0 yes 6.109248    0      no 13.73189      good 8.538699
6   2.39521  0 yes 6.109248    0     yes 13.00000      good 8.538699
      lfam educdec      age    sex child black
1 1.386294      12 43.87748   male    no    no
2 1.386294      12 17.59138   male   yes    no
3 1.386294      12 15.49966 female   yes    no
4 1.386294      12 44.14305 female    no    no
5 1.098612      12 14.54962 female   yes    no
6 1.098612      12 16.28268 female   yes    noThe names of the columns are:
colnames(df) [1] "med"      "lc"       "idp"      "lpi"      "fmde"     "physlim" 
 [7] "ndisease" "health"   "linc"     "lfam"     "educdec"  "age"     
[13] "sex"      "child"    "black"   Data Type of each columns
Lets use sapply to get the data types of each column(lapply and sapply produce the same output.It is just that lapply produces the output in a list while sapply produces the output as a vector)
data.type<-sapply(df,function(x){return(class(x))})
data.type[1:4]      med        lc       idp       lpi 
"numeric" "numeric"  "factor" "numeric" We can see that certain variables such as idp, health etc that are have text data in them. Currently they are represented as factors but should be represented as character.Lets try to change it
df.dtype<-data.frame(Variables=names(data.type),Dtype=as.character(data.type))
df.dtype   Variables   Dtype
1        med numeric
2         lc numeric
3        idp  factor
4        lpi numeric
5       fmde numeric
6    physlim  factor
7   ndisease numeric
8     health  factor
9       linc numeric
10      lfam numeric
11   educdec numeric
12       age numeric
13       sex  factor
14     child  factor
15     black  factorWe can see that the attributes such as idp, physlim and health are character vectors and not factors.Therefore, we need to convert them to character data type
Changing factor to character data type
change.columns<-df.dtype%>%
  filter(Dtype=="factor")%>%
  select(Variables)%>%
  pull(Variables)%>%
  as.character()
  
df[change.columns]<-sapply(df[change.columns],as.character)
New.Dtype<-data.frame(Variables=names(lapply(df,class)),
           df.dtype%>%select(Dtype)%>%rename(OldDtype=Dtype),
           NewDtype=as.character(lapply(df,class)))
New.Dtype   Variables OldDtype  NewDtype
1        med  numeric   numeric
2         lc  numeric   numeric
3        idp   factor character
4        lpi  numeric   numeric
5       fmde  numeric   numeric
6    physlim   factor character
7   ndisease  numeric   numeric
8     health   factor character
9       linc  numeric   numeric
10      lfam  numeric   numeric
11   educdec  numeric   numeric
12       age  numeric   numeric
13       sex   factor character
14     child   factor character
15     black   factor characterWe can see that the required change has been made. In this example we saw the usage of both lapply and sapply.Now since we know the columns that are numeric and categorical(charatcer), we can create certain summary stats around them
Summary Stats for Numeric column
Here we will use the apply function to calculate certain stats for Numeric columns such as:
- Mean
- Standard Deviation
- Max
- 25th, 50th and 75th Percentile
- Missing Values
columns.numeric<-New.Dtype%>%
  filter(NewDtype=="numeric")%>%
  select(Variables)%>%
  pull(Variables)%>%
  as.character()
# Creating the function to compute the Numeric Summary
Stats.numeric.summary<-function(x){
  
  summary.numeric<-c(mean(x,na.rm=T),sd(x,na.rm=T),max(x,na.rm=T),
         as.numeric(quantile(x,0.25,na.rm=T)),
         as.numeric(quantile(x,0.50,na.rm=T)),
         as.numeric(quantile(x,0.75,na.rm=T)),
        
          sum(is.na(x)))
  
  
  return(summary.numeric)
  
}
df.numeric.stats<-apply(df%>%
        select(columns.numeric)%>%
        as.matrix(),
        2,
        Stats.numeric.summary)
df.numeric.stats[,2:4]           lc      lpi     fmde
[1,] 2.420739 4.726834 4.065015
[2,] 2.043883 2.681354 3.450558
[3,] 4.564348 7.163699 8.294049
[4,] 0.000000 4.094345 0.000000
[5,] 3.258096 6.109248 6.160541
[6,] 4.564348 6.620073 6.959049
[7,] 0.000000 0.000000 0.000000Formatting the df.numeric.stats
# Adding a Summary Type column
Summary.Type<-c("Mean","SD","Max","25th Perc","Median","75th Perc","Missing Values")
df.numeric.stats%>%
  as.data.frame()%>%
  mutate('Summary.Type'=Summary.Type)%>%
  select(Summary.Type,everything())%>%
  gather(key='Attribute', value='Values', med:age)%>%
  spread('Summary.Type',Values)%>%
  select(Attribute,'25th Perc','75th Perc','Max')  Attribute 25th Perc  75th Perc          Max
1       age 11.340860  37.197810    63.275150
2   educdec 11.000000  13.000000    25.000000
3      fmde  0.000000   6.959049     8.294049
4        lc  0.000000   4.564348     4.564348
5      lfam  1.098612   1.609438     2.564949
6      linc  8.569546   9.250909    10.283240
7       lpi  4.094345   6.620073     7.163699
8       med  3.883493 101.201100 39182.020000
9  ndisease  6.900000  13.731890    58.600000We have created the Numeric Summary and created a data frame around it.Lets do the same thing for Categorical Variables
Frequency Profiling for Categorical column
required.columns<-New.Dtype%>%
  filter(NewDtype=="character")%>%
  pull(Variables)
df.interim<-df%>%
  select(required.columns)
l1<-lapply(df.interim,function(x){
  z<-x%>%
    as.data.frame()
  
  colnames(z)<-"Level"
  
  z1<-z%>%
    group_by(Level)%>%
    summarise(Total=n())
  
  return(z1)
})
df.final<-do.call(rbind.data.frame,l1)
df.final$Feature<-row.names(df.final)
row.names(df.final)<-NULL
Df.Final<-df.final%>%
  mutate(Feature2=sapply(Feature, function(x){
    
    str_split(x,"[.]")[[1]][1]
    
  }))%>%
  select(-Feature)%>%
  rename(Feature=Feature2)
head(Df.Final)# A tibble: 6 x 3
  Level    Total Feature 
  <chr>    <int> <chr>   
1 0          369 ndisease
2 3.4        561 ndisease
3 4.3         73 ndisease
4 6.9        539 ndisease
5 8.7         92 ndisease
6 9.967326   546 ndiseasePlease also check my blog on dplyr for Frequency Profiling on categorical variable https://mlmadeeasy.blogspot.com/2020/01/blog-8-dplyr-in-r.html
K-Means Analysis:Usage of apply function
In a K-Mean analysis, all numeric columns are mean centred. This can be achieved using apply function as we will see in the below section
if(!require("C50")){
  
  install.packages("C50")
}else{
  
  library(C50)
}
data(churn)
df.segment<-churnTrain
head(df.segment[,5:8])  voice_mail_plan number_vmail_messages total_day_minutes total_day_calls
1             yes                    25             265.1             110
2             yes                    26             161.6             123
3              no                     0             243.4             114
4              no                     0             299.4              71
5              no                     0             166.7             113
6              no                     0             223.4              98colnames(df.segment) [1] "state"                         "account_length"               
 [3] "area_code"                     "international_plan"           
 [5] "voice_mail_plan"               "number_vmail_messages"        
 [7] "total_day_minutes"             "total_day_calls"              
 [9] "total_day_charge"              "total_eve_minutes"            
[11] "total_eve_calls"               "total_eve_charge"             
[13] "total_night_minutes"           "total_night_calls"            
[15] "total_night_charge"            "total_intl_minutes"           
[17] "total_intl_calls"              "total_intl_charge"            
[19] "number_customer_service_calls" "churn"                        The columns from number_vmail_messages to number_customer_service_calls are all numeric. Before doing the segmentation exercise, we will mean center them
df.seg.numeric<-apply(df.segment%>%
  select(number_vmail_messages:number_customer_service_calls),
  2,
  function(x){
    
    y<-x-mean(x,na.rm=T)
    z<-y/sd(y,na.rm=T)
    
    return(z)
    
    }
  
  )
head(df.seg.numeric[,5:8])     total_eve_minutes total_eve_calls total_eve_charge
[1,]       -0.07059903     -0.05593196      -0.07041609
[2,]       -0.10806414      0.14484479      -0.10753331
[3,]       -1.57314731      0.49620411      -1.57366351
[4,]       -2.74245326     -0.60806803      -2.74285594
[5,]       -1.03877646      1.09853437      -1.03778364
[6,]        0.38686974      0.04445642       0.38658968
     total_night_minutes
[1,]          0.86661319
[2,]          1.05841193
[3,]         -0.75675551
[4,]         -0.07853935
[5,]         -0.27627001
[6,]          0.05987211Doing the K-Means Analysis
if(!require("stats")){
  
  install.packages("stats")
}else{
  
  library(stats)
}
model.segment<-kmeans(df.seg.numeric,5)
model.segment$size[1] 637 696 689 686 625Final Comments
apply family of functions are very powerful and internally optimised.In this blog we saw how we can do a repetitive task without using for loop. Examples ranging from changing the data type , generating numeric summary and using it in k-means Machine Learning algorithm gives us some flavour of what it is capable of accomplishing.I would strongly recommend folks to use them in place of for loop.
Link to Previous R Blogs
Blog 1-Vectors,Matrics, Lists and Data Frame in R https://mlmadeeasy.blogspot.com/2019/12/2datatypesr.html
Blog 2 - Operators in R https://mlmadeeasy.blogspot.com/2019/12/blog-2-operators-in-r.html
Blog 3 - Loops in R https://mlmadeeasy.blogspot.com/2019/12/blog-3-loops-in-r.html
Blog 4 - Indexing in R https://mlmadeeasy.blogspot.com/2019/12/blog-4-indexing-in-r.html
Blog 5- Handling NA in R https://mlmadeeasy.blogspot.com/2019/12/blog-5-handling-na-in-r.html
Blog 6- tips-to-generate-plots in R https://mlmadeeasy.blogspot.com/2019/12/blog-6tips-to-generate-plots.html
Blog 7- Functions in R https://mlmadeeasy.blogspot.com/2019/12/blog-7-creating-functions-in-r.html
Blog 8- dplyr in R https://mlmadeeasy.blogspot.com/2020/01/blog-8-dplyr-in-r.html
Blog 9- Unpivoting/Pivoting in R https://mlmadeeasy.blogspot.com/2020/01/blog-9-pivoting-and-unpivoting-in-r.html
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html
 
No comments:
Post a Comment