Apply Family of Functions in R
Parag Verma
Introduction
If you have interacted with any intermediate programer in R you would know their reluctance to use loops in their program.Instead they would rely on using the apply functions.It is mostly used to manipulate slices of data from matrix, array, list and data frame.It comes very handy when we are trying to break our code into modules/section. It reduces time to debug the code as well and above all it is internally optimised to run faster
Most frequently used apply functions in dplyr
- apply(mostly for Data Frame and Matrices)
- lapply(with list and vectors)
- sapply(with list and vectors)
- tapply
- tapply
- vapply
- rapply
We will mostly talk about apply, lapply and sapply during the course of this blog. We will look at practical problem and use cases where they can be applied
Installing the library: dplyr and ggplot2
if(!require("dplyr")){
install.packages("dplyr")
}else{
library(dplyr)
}
if(!require("stringr")){
install.packages("stringr")
}else{
library(stringr)
}
Importing the dataset
For this exercise we will look at the Medical expenses dataset for 5574 respondents. Our aim is to first look at the data types of each column and then use the apply functions to derive certain insights from the data set
if(!require("dplyr")){
install.packages("dplyr")
}else{
library(dplyr)
}
if(!require("tidyr")){
install.packages("tidyr")
}else{
library(tidyr)
}
# Ecdat library for importing the dataset
if(!require("Ecdat")){
install.packages("Ecdat")
}else{
library(Ecdat)
}
data(MedExp)
df<-MedExp
head(df)
med lc idp lpi fmde physlim ndisease health linc
1 62.07547 0 yes 6.907755 0 no 13.73189 good 9.528776
2 0.00000 0 yes 6.907755 0 no 13.73189 excellent 9.528776
3 27.76280 0 yes 6.907755 0 no 13.73189 excellent 9.528776
4 290.58220 0 yes 6.907755 0 no 13.73189 good 9.528776
5 0.00000 0 yes 6.109248 0 no 13.73189 good 8.538699
6 2.39521 0 yes 6.109248 0 yes 13.00000 good 8.538699
lfam educdec age sex child black
1 1.386294 12 43.87748 male no no
2 1.386294 12 17.59138 male yes no
3 1.386294 12 15.49966 female yes no
4 1.386294 12 44.14305 female no no
5 1.098612 12 14.54962 female yes no
6 1.098612 12 16.28268 female yes no
The names of the columns are:
colnames(df)
[1] "med" "lc" "idp" "lpi" "fmde" "physlim"
[7] "ndisease" "health" "linc" "lfam" "educdec" "age"
[13] "sex" "child" "black"
Data Type of each columns
Lets use sapply to get the data types of each column(lapply and sapply produce the same output.It is just that lapply produces the output in a list while sapply produces the output as a vector)
data.type<-sapply(df,function(x){return(class(x))})
data.type[1:4]
med lc idp lpi
"numeric" "numeric" "factor" "numeric"
We can see that certain variables such as idp, health etc that are have text data in them. Currently they are represented as factors but should be represented as character.Lets try to change it
df.dtype<-data.frame(Variables=names(data.type),Dtype=as.character(data.type))
df.dtype
Variables Dtype
1 med numeric
2 lc numeric
3 idp factor
4 lpi numeric
5 fmde numeric
6 physlim factor
7 ndisease numeric
8 health factor
9 linc numeric
10 lfam numeric
11 educdec numeric
12 age numeric
13 sex factor
14 child factor
15 black factor
We can see that the attributes such as idp, physlim and health are character vectors and not factors.Therefore, we need to convert them to character data type
Changing factor to character data type
change.columns<-df.dtype%>%
filter(Dtype=="factor")%>%
select(Variables)%>%
pull(Variables)%>%
as.character()
df[change.columns]<-sapply(df[change.columns],as.character)
New.Dtype<-data.frame(Variables=names(lapply(df,class)),
df.dtype%>%select(Dtype)%>%rename(OldDtype=Dtype),
NewDtype=as.character(lapply(df,class)))
New.Dtype
Variables OldDtype NewDtype
1 med numeric numeric
2 lc numeric numeric
3 idp factor character
4 lpi numeric numeric
5 fmde numeric numeric
6 physlim factor character
7 ndisease numeric numeric
8 health factor character
9 linc numeric numeric
10 lfam numeric numeric
11 educdec numeric numeric
12 age numeric numeric
13 sex factor character
14 child factor character
15 black factor character
We can see that the required change has been made. In this example we saw the usage of both lapply and sapply.Now since we know the columns that are numeric and categorical(charatcer), we can create certain summary stats around them
Summary Stats for Numeric column
Here we will use the apply function to calculate certain stats for Numeric columns such as:
- Mean
- Standard Deviation
- Max
- 25th, 50th and 75th Percentile
- Missing Values
columns.numeric<-New.Dtype%>%
filter(NewDtype=="numeric")%>%
select(Variables)%>%
pull(Variables)%>%
as.character()
# Creating the function to compute the Numeric Summary
Stats.numeric.summary<-function(x){
summary.numeric<-c(mean(x,na.rm=T),sd(x,na.rm=T),max(x,na.rm=T),
as.numeric(quantile(x,0.25,na.rm=T)),
as.numeric(quantile(x,0.50,na.rm=T)),
as.numeric(quantile(x,0.75,na.rm=T)),
sum(is.na(x)))
return(summary.numeric)
}
df.numeric.stats<-apply(df%>%
select(columns.numeric)%>%
as.matrix(),
2,
Stats.numeric.summary)
df.numeric.stats[,2:4]
lc lpi fmde
[1,] 2.420739 4.726834 4.065015
[2,] 2.043883 2.681354 3.450558
[3,] 4.564348 7.163699 8.294049
[4,] 0.000000 4.094345 0.000000
[5,] 3.258096 6.109248 6.160541
[6,] 4.564348 6.620073 6.959049
[7,] 0.000000 0.000000 0.000000
Formatting the df.numeric.stats
# Adding a Summary Type column
Summary.Type<-c("Mean","SD","Max","25th Perc","Median","75th Perc","Missing Values")
df.numeric.stats%>%
as.data.frame()%>%
mutate('Summary.Type'=Summary.Type)%>%
select(Summary.Type,everything())%>%
gather(key='Attribute', value='Values', med:age)%>%
spread('Summary.Type',Values)%>%
select(Attribute,'25th Perc','75th Perc','Max')
Attribute 25th Perc 75th Perc Max
1 age 11.340860 37.197810 63.275150
2 educdec 11.000000 13.000000 25.000000
3 fmde 0.000000 6.959049 8.294049
4 lc 0.000000 4.564348 4.564348
5 lfam 1.098612 1.609438 2.564949
6 linc 8.569546 9.250909 10.283240
7 lpi 4.094345 6.620073 7.163699
8 med 3.883493 101.201100 39182.020000
9 ndisease 6.900000 13.731890 58.600000
We have created the Numeric Summary and created a data frame around it.Lets do the same thing for Categorical Variables
Frequency Profiling for Categorical column
required.columns<-New.Dtype%>%
filter(NewDtype=="character")%>%
pull(Variables)
df.interim<-df%>%
select(required.columns)
l1<-lapply(df.interim,function(x){
z<-x%>%
as.data.frame()
colnames(z)<-"Level"
z1<-z%>%
group_by(Level)%>%
summarise(Total=n())
return(z1)
})
df.final<-do.call(rbind.data.frame,l1)
df.final$Feature<-row.names(df.final)
row.names(df.final)<-NULL
Df.Final<-df.final%>%
mutate(Feature2=sapply(Feature, function(x){
str_split(x,"[.]")[[1]][1]
}))%>%
select(-Feature)%>%
rename(Feature=Feature2)
head(Df.Final)
# A tibble: 6 x 3
Level Total Feature
<chr> <int> <chr>
1 0 369 ndisease
2 3.4 561 ndisease
3 4.3 73 ndisease
4 6.9 539 ndisease
5 8.7 92 ndisease
6 9.967326 546 ndisease
Please also check my blog on dplyr for Frequency Profiling on categorical variable https://mlmadeeasy.blogspot.com/2020/01/blog-8-dplyr-in-r.html
K-Means Analysis:Usage of apply function
In a K-Mean analysis, all numeric columns are mean centred. This can be achieved using apply function as we will see in the below section
if(!require("C50")){
install.packages("C50")
}else{
library(C50)
}
data(churn)
df.segment<-churnTrain
head(df.segment[,5:8])
voice_mail_plan number_vmail_messages total_day_minutes total_day_calls
1 yes 25 265.1 110
2 yes 26 161.6 123
3 no 0 243.4 114
4 no 0 299.4 71
5 no 0 166.7 113
6 no 0 223.4 98
colnames(df.segment)
[1] "state" "account_length"
[3] "area_code" "international_plan"
[5] "voice_mail_plan" "number_vmail_messages"
[7] "total_day_minutes" "total_day_calls"
[9] "total_day_charge" "total_eve_minutes"
[11] "total_eve_calls" "total_eve_charge"
[13] "total_night_minutes" "total_night_calls"
[15] "total_night_charge" "total_intl_minutes"
[17] "total_intl_calls" "total_intl_charge"
[19] "number_customer_service_calls" "churn"
The columns from number_vmail_messages to number_customer_service_calls are all numeric. Before doing the segmentation exercise, we will mean center them
df.seg.numeric<-apply(df.segment%>%
select(number_vmail_messages:number_customer_service_calls),
2,
function(x){
y<-x-mean(x,na.rm=T)
z<-y/sd(y,na.rm=T)
return(z)
}
)
head(df.seg.numeric[,5:8])
total_eve_minutes total_eve_calls total_eve_charge
[1,] -0.07059903 -0.05593196 -0.07041609
[2,] -0.10806414 0.14484479 -0.10753331
[3,] -1.57314731 0.49620411 -1.57366351
[4,] -2.74245326 -0.60806803 -2.74285594
[5,] -1.03877646 1.09853437 -1.03778364
[6,] 0.38686974 0.04445642 0.38658968
total_night_minutes
[1,] 0.86661319
[2,] 1.05841193
[3,] -0.75675551
[4,] -0.07853935
[5,] -0.27627001
[6,] 0.05987211
Doing the K-Means Analysis
if(!require("stats")){
install.packages("stats")
}else{
library(stats)
}
model.segment<-kmeans(df.seg.numeric,5)
model.segment$size
[1] 637 696 689 686 625
Final Comments
apply family of functions are very powerful and internally optimised.In this blog we saw how we can do a repetitive task without using for loop. Examples ranging from changing the data type , generating numeric summary and using it in k-means Machine Learning algorithm gives us some flavour of what it is capable of accomplishing.I would strongly recommend folks to use them in place of for loop.
Link to Previous R Blogs
Blog 1-Vectors,Matrics, Lists and Data Frame in R https://mlmadeeasy.blogspot.com/2019/12/2datatypesr.html
Blog 2 - Operators in R https://mlmadeeasy.blogspot.com/2019/12/blog-2-operators-in-r.html
Blog 3 - Loops in R https://mlmadeeasy.blogspot.com/2019/12/blog-3-loops-in-r.html
Blog 4 - Indexing in R https://mlmadeeasy.blogspot.com/2019/12/blog-4-indexing-in-r.html
Blog 5- Handling NA in R https://mlmadeeasy.blogspot.com/2019/12/blog-5-handling-na-in-r.html
Blog 6- tips-to-generate-plots in R https://mlmadeeasy.blogspot.com/2019/12/blog-6tips-to-generate-plots.html
Blog 7- Functions in R https://mlmadeeasy.blogspot.com/2019/12/blog-7-creating-functions-in-r.html
Blog 8- dplyr in R https://mlmadeeasy.blogspot.com/2020/01/blog-8-dplyr-in-r.html
Blog 9- Unpivoting/Pivoting in R https://mlmadeeasy.blogspot.com/2020/01/blog-9-pivoting-and-unpivoting-in-r.html
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html
No comments:
Post a Comment