Loops in R
Parag Verma
20th Dec, 2019
Loops in R
Loops are used in almost all the programming languages.They help us to run through each element within say a vector, list or through each column within a data frame. For this blog, we would look at just the ‘for’ loop as for most of the problems it suffices. We would look at certain scenarios where for loop is used in R
For Loop:General Syntax
for (val in sequence) {
statement
}
Print the square of numbers in a vector
# Creating a vector
vector1<-c(1:4)
for(j in vector1) {
print(j*j)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
Loop used with a Data Frame
Most tasks in R are related to processing a Data Frame. Given below are few instances where a For loop is used with a Data Frame.Lets create a Data Frame mimicking a scenario at a Manufacturing plant
dep.data <- data.frame(
X.Dept_name = c("Production","Finance","HR","Quality Control","Marketting","Sales"),
X.Head_count=c(100,20,5,10,40,70),
X.Avg_salary = c(623.3,515.2,611.0,729.0,843.25,790.50) ,
X.Incentive_given=c("Yes","No","No","Yes","Yes","Yes"),
stringsAsFactors = F
)
dep.data
## X.Dept_name X.Head_count X.Avg_salary X.Incentive_given
## 1 Production 100 623.30 Yes
## 2 Finance 20 515.20 No
## 3 HR 5 611.00 No
## 4 Quality Control 10 729.00 Yes
## 5 Marketting 40 843.25 Yes
## 6 Sales 70 790.50 Yes
Calculating Basic Stats for each coulum
l1<-list()
for(i in colnames(dep.data)){
if(class(dep.data[,i]) %in% c("character","logical")){
l1[[i]]<-names(sort(table(dep.data[[i]]),T)[1])
}else if(class(dep.data[,i]) %in% c("numeric","integer")){
l1[[i]]<-mean(dep.data[,i],na.rm=T)
}else{
temp<-1
}
}
# Print the list
l1
## $X.Dept_name
## [1] "Finance"
##
## $X.Head_count
## [1] 40.83333
##
## $X.Avg_salary
## [1] 685.375
##
## $X.Incentive_given
## [1] "Yes"
Using for loop to correct the column names in the Data Frame
We can see that there is a ‘X.’ prefix at the begining of each column names. This can sometime happen when the data is imported from a csv and the column names has only digits in it. For instance if one of the column in csv file is ‘2012’, then when it is read into R console, it might be read as X.2012. In the below section we will look at how to deal with this scenario.We will use regular expression for handling this
if(!require("stringr")){
install.packages("stringr")
}else{
library(stringr)
}
# Lets output the column names
colnames(dep.data)
## [1] "X.Dept_name" "X.Head_count" "X.Avg_salary"
## [4] "X.Incentive_given"
We will use gsub function to replace ‘X.’ with blank value. In this way we can remove the prefix from the column names.We will store it into new.names vector and assign it again to colnames(dep.data)
# Replacing the prefix with blank values
new.names<-gsub(pattern="X.",replacement ="",x=colnames(dep.data))
colnames(dep.data)<-new.names
head(dep.data)
## Dept_name Head_count Avg_salary Incentive_given
## 1 Production 100 623.30 Yes
## 2 Finance 20 515.20 No
## 3 HR 5 611.00 No
## 4 Quality Control 10 729.00 Yes
## 5 Marketting 40 843.25 Yes
## 6 Sales 70 790.50 Yes
Updating names in the list
In a similar way we will update the names in the list l1
names(l1)<-new.names
names(l1)
## [1] "Dept_name" "Head_count" "Avg_salary" "Incentive_given"
Frequency Profiling
While proceeding with any analyses, we need to calculate certain statistics that point us to the central tendency in the variable/column. If the variable is categorical in nature, frequency count is one thing that is widely reported.We will take an inbuilt dataset and create a frequency profile summary out of it
if(!require("dplyr")){
install.packages("dplyr")
}else{
library(dplyr)
}
# Ecdat library for importing the dataset
if(!require("Ecdat")){
install.packages("Ecdat")
}else{
library(Ecdat)
}
data(MedExp)
df<-MedExp
head(df)
## med lc idp lpi fmde physlim ndisease health linc
## 1 62.07547 0 yes 6.907755 0 no 13.73189 good 9.528776
## 2 0.00000 0 yes 6.907755 0 no 13.73189 excellent 9.528776
## 3 27.76280 0 yes 6.907755 0 no 13.73189 excellent 9.528776
## 4 290.58220 0 yes 6.907755 0 no 13.73189 good 9.528776
## 5 0.00000 0 yes 6.109248 0 no 13.73189 good 8.538699
## 6 2.39521 0 yes 6.109248 0 yes 13.00000 good 8.538699
## lfam educdec age sex child black
## 1 1.386294 12 43.87748 male no no
## 2 1.386294 12 17.59138 male yes no
## 3 1.386294 12 15.49966 female yes no
## 4 1.386294 12 44.14305 female no no
## 5 1.098612 12 14.54962 female yes no
## 6 1.098612 12 16.28268 female yes no
Lets see how many categorical columns are present
colnames(df)
## [1] "med" "lc" "idp" "lpi" "fmde" "physlim"
## [7] "ndisease" "health" "linc" "lfam" "educdec" "age"
## [13] "sex" "child" "black"
The following columns are categorical in nature
- idp
- physlim
- health
- sex
- child
- black
Lets use the above columns to create frequency profile
required.columns<-c("idp","physlim","health",
"sex","child","black")
ls.interim<-list()
for(i in required.columns){
temp.df<-df%>%
mutate(Feature_Name=i)%>%
group_by(df[,i],Feature_Name)%>%
summarise(Number_Records=n())
colnames(temp.df)<-c("Level","Feature_Name","Count_of_Records")
ls.interim[[i]]<-temp.df
}
freq.profile<-do.call(rbind.data.frame,ls.interim)
row.names(freq.profile)<-NULL
head(freq.profile)
## # A tibble: 6 x 3
## Level Feature_Name Count_of_Records
## <fct> <chr> <int>
## 1 no idp 4115
## 2 yes idp 1459
## 3 no physlim 4657
## 4 yes physlim 917
## 5 excellent health 3017
## 6 good health 2034
Final Comments
In this blog we have seen how we can use a for loop in a real case scenario.Mostly for loop can be used to change column names in a data frame, lists,iterate through column values in a data frame. We have also looked at how we can use regular expression to correct names in a data frame and list by replacement. There will be a detailed blog on regular expression in weeks to come
Link to Previous R Blogs
Blog 1-Vectors,Matrics, Lists and Data Frame in R https://mlmadeeasy.blogspot.com/2019/12/2datatypesr.html
Blog 2 - Operators in R https://mlmadeeasy.blogspot.com/2019/12/blog-2-operators-in-r.html
This comment has been removed by the author.
ReplyDeleteWhats the diff between the 2 codes for renaming cols
ReplyDeletecolnames(iris)[which(colnames(iris) == "Species.Name")]
colnames(iris[which(colnames(iris) == "Species.Name")])
1. colnames(iris)[which(colnames(iris) == "Species.Name")]
Deletewhich(colnames(iris) == "Species.Name") will give the position of the "Species.Name" column.Lets say it is 3
colnames(iris)[3] will give the name of the column that is at the 3rd position.This will be equal to "Species.Name"
2.colnames(iris[which(colnames(iris) == "Species.Name")])
which(colnames(iris) == "Species.Name") will give the position of the "Species.Name" column.Lets say it is 3
iris[3] will give the column values in the 3 index position which is nothing but "Species.Name". Result of this expression will be a data frame with only "Species.Name" as the column
so colnames(iris[which(colnames(iris) == "Species.Name")]) will give "Species.Name"
Also no column can be renamed using the above two lines of code
ReplyDelete