Friday, December 20, 2019

Blog 3- Loops in R

Loops in R


Loops in R

Loops are used in almost all the programming languages.They help us to run through each element within say a vector, list or through each column within a data frame. For this blog, we would look at just the ‘for’ loop as for most of the problems it suffices. We would look at certain scenarios where for loop is used in R

For Loop:General Syntax

for (val in sequence) {

statement
}

Loop used with a Data Frame

Most tasks in R are related to processing a Data Frame. Given below are few instances where a For loop is used with a Data Frame.Lets create a Data Frame mimicking a scenario at a Manufacturing plant

dep.data <- data.frame(
 
  X.Dept_name = c("Production","Finance","HR","Quality Control","Marketting","Sales"),
  X.Head_count=c(100,20,5,10,40,70),
  X.Avg_salary = c(623.3,515.2,611.0,729.0,843.25,790.50) ,
  X.Incentive_given=c("Yes","No","No","Yes","Yes","Yes"),
  stringsAsFactors = F
                      )

dep.data
##       X.Dept_name X.Head_count X.Avg_salary X.Incentive_given
## 1      Production          100       623.30               Yes
## 2         Finance           20       515.20                No
## 3              HR            5       611.00                No
## 4 Quality Control           10       729.00               Yes
## 5      Marketting           40       843.25               Yes
## 6           Sales           70       790.50               Yes


Calculating Basic Stats for each coulum

l1<-list()
for(i in colnames(dep.data)){
  
  if(class(dep.data[,i]) %in% c("character","logical")){
    
    l1[[i]]<-names(sort(table(dep.data[[i]]),T)[1])
    
  }else if(class(dep.data[,i]) %in% c("numeric","integer")){
    
    l1[[i]]<-mean(dep.data[,i],na.rm=T)
    
  }else{
    
    temp<-1
  }
  
}

# Print the list
l1
## $X.Dept_name
## [1] "Finance"
## 
## $X.Head_count
## [1] 40.83333
## 
## $X.Avg_salary
## [1] 685.375
## 
## $X.Incentive_given
## [1] "Yes"

Using for loop to correct the column names in the Data Frame

We can see that there is a ‘X.’ prefix at the begining of each column names. This can sometime happen when the data is imported from a csv and the column names has only digits in it. For instance if one of the column in csv file is ‘2012’, then when it is read into R console, it might be read as X.2012. In the below section we will look at how to deal with this scenario.We will use regular expression for handling this

if(!require("stringr")){
  
  install.packages("stringr")
}else{
  
  library(stringr)
}

# Lets output the column names
colnames(dep.data)
## [1] "X.Dept_name"       "X.Head_count"      "X.Avg_salary"     
## [4] "X.Incentive_given"

We will use gsub function to replace ‘X.’ with blank value. In this way we can remove the prefix from the column names.We will store it into new.names vector and assign it again to colnames(dep.data)

# Replacing the prefix with blank values
new.names<-gsub(pattern="X.",replacement ="",x=colnames(dep.data))
colnames(dep.data)<-new.names
head(dep.data)
##         Dept_name Head_count Avg_salary Incentive_given
## 1      Production        100     623.30             Yes
## 2         Finance         20     515.20              No
## 3              HR          5     611.00              No
## 4 Quality Control         10     729.00             Yes
## 5      Marketting         40     843.25             Yes
## 6           Sales         70     790.50             Yes



Updating names in the list

In a similar way we will update the names in the list l1

names(l1)<-new.names
names(l1)
## [1] "Dept_name"       "Head_count"      "Avg_salary"      "Incentive_given"


Frequency Profiling

While proceeding with any analyses, we need to calculate certain statistics that point us to the central tendency in the variable/column. If the variable is categorical in nature, frequency count is one thing that is widely reported.We will take an inbuilt dataset and create a frequency profile summary out of it

if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(dplyr)
}


# Ecdat library for importing the dataset

if(!require("Ecdat")){
  
  install.packages("Ecdat")
}else{
  
  library(Ecdat)
}

data(MedExp)
df<-MedExp
head(df)
##         med lc idp      lpi fmde physlim ndisease    health     linc
## 1  62.07547  0 yes 6.907755    0      no 13.73189      good 9.528776
## 2   0.00000  0 yes 6.907755    0      no 13.73189 excellent 9.528776
## 3  27.76280  0 yes 6.907755    0      no 13.73189 excellent 9.528776
## 4 290.58220  0 yes 6.907755    0      no 13.73189      good 9.528776
## 5   0.00000  0 yes 6.109248    0      no 13.73189      good 8.538699
## 6   2.39521  0 yes 6.109248    0     yes 13.00000      good 8.538699
##       lfam educdec      age    sex child black
## 1 1.386294      12 43.87748   male    no    no
## 2 1.386294      12 17.59138   male   yes    no
## 3 1.386294      12 15.49966 female   yes    no
## 4 1.386294      12 44.14305 female    no    no
## 5 1.098612      12 14.54962 female   yes    no
## 6 1.098612      12 16.28268 female   yes    no


Lets see how many categorical columns are present

colnames(df)
##  [1] "med"      "lc"       "idp"      "lpi"      "fmde"     "physlim" 
##  [7] "ndisease" "health"   "linc"     "lfam"     "educdec"  "age"     
## [13] "sex"      "child"    "black"

The following columns are categorical in nature

  • idp
  • physlim
  • health
  • sex
  • child
  • black


Lets use the above columns to create frequency profile

required.columns<-c("idp","physlim","health",
                    "sex","child","black")
ls.interim<-list()
for(i in required.columns){
  
  temp.df<-df%>%
    mutate(Feature_Name=i)%>%
    group_by(df[,i],Feature_Name)%>%
    summarise(Number_Records=n())
  
  colnames(temp.df)<-c("Level","Feature_Name","Count_of_Records")
  
  ls.interim[[i]]<-temp.df
    
}

freq.profile<-do.call(rbind.data.frame,ls.interim)
row.names(freq.profile)<-NULL
head(freq.profile)
## # A tibble: 6 x 3
##   Level     Feature_Name Count_of_Records
##   <fct>     <chr>                   <int>
## 1 no        idp                      4115
## 2 yes       idp                      1459
## 3 no        physlim                  4657
## 4 yes       physlim                   917
## 5 excellent health                   3017
## 6 good      health                   2034

Final Comments

In this blog we have seen how we can use a for loop in a real case scenario.Mostly for loop can be used to change column names in a data frame, lists,iterate through column values in a data frame. We have also looked at how we can use regular expression to correct names in a data frame and list by replacement. There will be a detailed blog on regular expression in weeks to come

4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Whats the diff between the 2 codes for renaming cols
    colnames(iris)[which(colnames(iris) == "Species.Name")]
    colnames(iris[which(colnames(iris) == "Species.Name")])

    ReplyDelete
    Replies
    1. 1. colnames(iris)[which(colnames(iris) == "Species.Name")]
      which(colnames(iris) == "Species.Name") will give the position of the "Species.Name" column.Lets say it is 3

      colnames(iris)[3] will give the name of the column that is at the 3rd position.This will be equal to "Species.Name"

      2.colnames(iris[which(colnames(iris) == "Species.Name")])

      which(colnames(iris) == "Species.Name") will give the position of the "Species.Name" column.Lets say it is 3

      iris[3] will give the column values in the 3 index position which is nothing but "Species.Name". Result of this expression will be a data frame with only "Species.Name" as the column

      so colnames(iris[which(colnames(iris) == "Species.Name")]) will give "Species.Name"


      Delete
  3. Also no column can be renamed using the above two lines of code

    ReplyDelete

Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...