Machine Learning Made Easy: Blog 10: apply functions in R

Apply Family of Functions in R

Introduction

If you have interacted with any intermediate programer in R you would know their reluctance to use loops in their program.Instead they would rely on using the apply functions.It is mostly used to manipulate slices of data from matrix, array, list and data frame.It comes very handy when we are trying to break our code into modules/section. It reduces time to debug the code as well and above all it is internally optimised to run faster

Most frequently used apply functions in dplyr

apply(mostly for Data Frame and Matrices)
lapply(with list and vectors)
sapply(with list and vectors)
tapply
tapply
vapply
rapply

We will mostly talk about apply, lapply and sapply during the course of this blog. We will look at practical problem and use cases where they can be applied

Installing the library: dplyr and ggplot2

if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(dplyr)
}


if(!require("stringr")){
  
  install.packages("stringr")
}else{
  
  library(stringr)
}

Importing the dataset

For this exercise we will look at the Medical expenses dataset for 5574 respondents. Our aim is to first look at the data types of each column and then use the apply functions to derive certain insights from the data set

if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(dplyr)
}

if(!require("tidyr")){
  
  install.packages("tidyr")
}else{
  
  library(tidyr)
}

# Ecdat library for importing the dataset

if(!require("Ecdat")){
  
  install.packages("Ecdat")
}else{
  
  library(Ecdat)
}

data(MedExp)
df<-MedExp
head(df)

        med lc idp      lpi fmde physlim ndisease    health     linc
1  62.07547  0 yes 6.907755    0      no 13.73189      good 9.528776
2   0.00000  0 yes 6.907755    0      no 13.73189 excellent 9.528776
3  27.76280  0 yes 6.907755    0      no 13.73189 excellent 9.528776
4 290.58220  0 yes 6.907755    0      no 13.73189      good 9.528776
5   0.00000  0 yes 6.109248    0      no 13.73189      good 8.538699
6   2.39521  0 yes 6.109248    0     yes 13.00000      good 8.538699
      lfam educdec      age    sex child black
1 1.386294      12 43.87748   male    no    no
2 1.386294      12 17.59138   male   yes    no
3 1.386294      12 15.49966 female   yes    no
4 1.386294      12 44.14305 female    no    no
5 1.098612      12 14.54962 female   yes    no
6 1.098612      12 16.28268 female   yes    no

The names of the columns are:

colnames(df)

 [1] "med"      "lc"       "idp"      "lpi"      "fmde"     "physlim" 
 [7] "ndisease" "health"   "linc"     "lfam"     "educdec"  "age"     
[13] "sex"      "child"    "black"

Data Type of each columns

Lets use sapply to get the data types of each column(lapply and sapply produce the same output.It is just that lapply produces the output in a list while sapply produces the output as a vector)

data.type<-sapply(df,function(x){return(class(x))})
data.type[1:4]

      med        lc       idp       lpi 
"numeric" "numeric"  "factor" "numeric"

We can see that certain variables such as idp, health etc that are have text data in them. Currently they are represented as factors but should be represented as character.Lets try to change it

df.dtype<-data.frame(Variables=names(data.type),Dtype=as.character(data.type))
df.dtype

   Variables   Dtype
1        med numeric
2         lc numeric
3        idp  factor
4        lpi numeric
5       fmde numeric
6    physlim  factor
7   ndisease numeric
8     health  factor
9       linc numeric
10      lfam numeric
11   educdec numeric
12       age numeric
13       sex  factor
14     child  factor
15     black  factor

We can see that the attributes such as idp, physlim and health are character vectors and not factors.Therefore, we need to convert them to character data type

Changing factor to character data type

change.columns<-df.dtype%>%
  filter(Dtype=="factor")%>%
  select(Variables)%>%
  pull(Variables)%>%
  as.character()
  

df[change.columns]<-sapply(df[change.columns],as.character)
New.Dtype<-data.frame(Variables=names(lapply(df,class)),
           df.dtype%>%select(Dtype)%>%rename(OldDtype=Dtype),
           NewDtype=as.character(lapply(df,class)))

New.Dtype

   Variables OldDtype  NewDtype
1        med  numeric   numeric
2         lc  numeric   numeric
3        idp   factor character
4        lpi  numeric   numeric
5       fmde  numeric   numeric
6    physlim   factor character
7   ndisease  numeric   numeric
8     health   factor character
9       linc  numeric   numeric
10      lfam  numeric   numeric
11   educdec  numeric   numeric
12       age  numeric   numeric
13       sex   factor character
14     child   factor character
15     black   factor character

We can see that the required change has been made. In this example we saw the usage of both lapply and sapply.Now since we know the columns that are numeric and categorical(charatcer), we can create certain summary stats around them

Summary Stats for Numeric column

Here we will use the apply function to calculate certain stats for Numeric columns such as:

Mean
Standard Deviation
Max
25th, 50th and 75th Percentile
Missing Values

columns.numeric<-New.Dtype%>%
  filter(NewDtype=="numeric")%>%
  select(Variables)%>%
  pull(Variables)%>%
  as.character()

# Creating the function to compute the Numeric Summary
Stats.numeric.summary<-function(x){
  
  summary.numeric<-c(mean(x,na.rm=T),sd(x,na.rm=T),max(x,na.rm=T),
         as.numeric(quantile(x,0.25,na.rm=T)),
         as.numeric(quantile(x,0.50,na.rm=T)),
         as.numeric(quantile(x,0.75,na.rm=T)),
        
          sum(is.na(x)))
  
  
  return(summary.numeric)
  
}

df.numeric.stats<-apply(df%>%
        select(columns.numeric)%>%
        as.matrix(),
        2,
        Stats.numeric.summary)

df.numeric.stats[,2:4]

           lc      lpi     fmde
[1,] 2.420739 4.726834 4.065015
[2,] 2.043883 2.681354 3.450558
[3,] 4.564348 7.163699 8.294049
[4,] 0.000000 4.094345 0.000000
[5,] 3.258096 6.109248 6.160541
[6,] 4.564348 6.620073 6.959049
[7,] 0.000000 0.000000 0.000000

Formatting the df.numeric.stats

# Adding a Summary Type column

Summary.Type<-c("Mean","SD","Max","25th Perc","Median","75th Perc","Missing Values")


df.numeric.stats%>%
  as.data.frame()%>%
  mutate('Summary.Type'=Summary.Type)%>%
  select(Summary.Type,everything())%>%
  gather(key='Attribute', value='Values', med:age)%>%
  spread('Summary.Type',Values)%>%
  select(Attribute,'25th Perc','75th Perc','Max')

  Attribute 25th Perc  75th Perc          Max
1       age 11.340860  37.197810    63.275150
2   educdec 11.000000  13.000000    25.000000
3      fmde  0.000000   6.959049     8.294049
4        lc  0.000000   4.564348     4.564348
5      lfam  1.098612   1.609438     2.564949
6      linc  8.569546   9.250909    10.283240
7       lpi  4.094345   6.620073     7.163699
8       med  3.883493 101.201100 39182.020000
9  ndisease  6.900000  13.731890    58.600000

We have created the Numeric Summary and created a data frame around it.Lets do the same thing for Categorical Variables

Frequency Profiling for Categorical column

required.columns<-New.Dtype%>%
  filter(NewDtype=="character")%>%
  pull(Variables)

df.interim<-df%>%
  select(required.columns)

l1<-lapply(df.interim,function(x){

  z<-x%>%
    as.data.frame()
  
  colnames(z)<-"Level"
  
  z1<-z%>%
    group_by(Level)%>%
    summarise(Total=n())
  
  return(z1)
})

df.final<-do.call(rbind.data.frame,l1)
df.final$Feature<-row.names(df.final)
row.names(df.final)<-NULL


Df.Final<-df.final%>%
  mutate(Feature2=sapply(Feature, function(x){
    
    str_split(x,"[.]")[[1]][1]
    
  }))%>%
  select(-Feature)%>%
  rename(Feature=Feature2)

head(Df.Final)

# A tibble: 6 x 3
  Level    Total Feature 
  <chr>    <int> <chr>   
1 0          369 ndisease
2 3.4        561 ndisease
3 4.3         73 ndisease
4 6.9        539 ndisease
5 8.7         92 ndisease
6 9.967326   546 ndisease

Please also check my blog on dplyr for Frequency Profiling on categorical variable https://mlmadeeasy.blogspot.com/2020/01/blog-8-dplyr-in-r.html

K-Means Analysis:Usage of apply function

In a K-Mean analysis, all numeric columns are mean centred. This can be achieved using apply function as we will see in the below section

if(!require("C50")){
  
  install.packages("C50")
}else{
  
  library(C50)
}

data(churn)
df.segment<-churnTrain
head(df.segment[,5:8])

  voice_mail_plan number_vmail_messages total_day_minutes total_day_calls
1             yes                    25             265.1             110
2             yes                    26             161.6             123
3              no                     0             243.4             114
4              no                     0             299.4              71
5              no                     0             166.7             113
6              no                     0             223.4              98

colnames(df.segment)

 [1] "state"                         "account_length"               
 [3] "area_code"                     "international_plan"           
 [5] "voice_mail_plan"               "number_vmail_messages"        
 [7] "total_day_minutes"             "total_day_calls"              
 [9] "total_day_charge"              "total_eve_minutes"            
[11] "total_eve_calls"               "total_eve_charge"             
[13] "total_night_minutes"           "total_night_calls"            
[15] "total_night_charge"            "total_intl_minutes"           
[17] "total_intl_calls"              "total_intl_charge"            
[19] "number_customer_service_calls" "churn"

The columns from number_vmail_messages to number_customer_service_calls are all numeric. Before doing the segmentation exercise, we will mean center them

df.seg.numeric<-apply(df.segment%>%
  select(number_vmail_messages:number_customer_service_calls),
  2,
  function(x){
    
    y<-x-mean(x,na.rm=T)
    z<-y/sd(y,na.rm=T)
    
    return(z)
    
    }
  
  )

head(df.seg.numeric[,5:8])

     total_eve_minutes total_eve_calls total_eve_charge
[1,]       -0.07059903     -0.05593196      -0.07041609
[2,]       -0.10806414      0.14484479      -0.10753331
[3,]       -1.57314731      0.49620411      -1.57366351
[4,]       -2.74245326     -0.60806803      -2.74285594
[5,]       -1.03877646      1.09853437      -1.03778364
[6,]        0.38686974      0.04445642       0.38658968
     total_night_minutes
[1,]          0.86661319
[2,]          1.05841193
[3,]         -0.75675551
[4,]         -0.07853935
[5,]         -0.27627001
[6,]          0.05987211

Doing the K-Means Analysis

if(!require("stats")){
  
  install.packages("stats")
}else{
  
  library(stats)
}


model.segment<-kmeans(df.seg.numeric,5)
model.segment$size

[1] 637 696 689 686 625

Final Comments

apply family of functions are very powerful and internally optimised.In this blog we saw how we can do a repetitive task without using for loop. Examples ranging from changing the data type , generating numeric summary and using it in k-means Machine Learning algorithm gives us some flavour of what it is capable of accomplishing.I would strongly recommend folks to use them in place of for loop.