Thursday, December 26, 2019

Blog 4- Indexing in R

Indexing in R



Indexing in R

Once we store data in an element such as vectors, lists or a data frame, it becomes very important to understand how we can traverse through the element and extract values. Most of the time the extraction is based on whether a particular consition is true or false. In this respect we will be focussing on a function called ‘which’. It is widely used to retrieve values from an element based on a given condition. It is very important to note that INDEXING in R STARTS FROM 1

Accessing values within a vector

Lets create a vector and try accessing its values.We can retrieve values in a vector by declaring an index inside a single square bracket “[]” operator

s = c("aa", "bb", "cc", "dd", "ee") 
s[3]
[1] "cc"

The square bracket operation results in a vector.This can be confirmed by using class(s[3]) operation

class(s[3])
[1] "character"


Extracting more than one value from a vector

s[c(1,3)]
[1] "aa" "cc"

We can see that values at index position 1 and 3 have been extracted

Using negative indices within square brackets

People who have worked in python knows that negative indexing is used as much as positive indexing.In R, it has a different meaning.Negative indexing essentially removes the element from that position.

s[-3]
[1] "aa" "bb" "dd" "ee"

We can see that ‘cc’ was present in the thrid position. After executing the statement s[-3], ’cc’ has been removed

Out of Range Index

What would happen if we try to access element at an index that is not present.Lets try and find that out

s[10]
[1] NA

We would get an NA

Lets remove the first 3 elements from s

s[-1:-3]
[1] "dd" "ee"


Removing the last element from s

s[-length(s)]
[1] "aa" "bb" "cc" "dd"


Indexing with a data frame

In a data frame, we can select rows or columns or both.So essentially we will be looking at ways to extract set of rows and/or subset of columns.Lets declare a data frame

dep.data <- data.frame(
 
  X.Dept_name = c("Production","Finance","HR","Quality Control","Marketting","Sales"),
  X.Head_count=c(100,20,5,10,40,70),
  X.Avg_salary = c(623.3,515.2,611.0,729.0,843.25,790.50) ,
  X.Incentive_given=c("Yes","No","No","Yes","Yes","Yes"),
  stringsAsFactors = F
                      )

dep.data
      X.Dept_name X.Head_count X.Avg_salary X.Incentive_given
1      Production          100       623.30               Yes
2         Finance           20       515.20                No
3              HR            5       611.00                No
4 Quality Control           10       729.00               Yes
5      Marketting           40       843.25               Yes
6           Sales           70       790.50               Yes


Get the element at row 1, column 3

Here we will supply row number of row indices and column number for column indices

dep.data[1,3]
[1] 623.3

This can also be done by supply row number of row indices and column NAME for column indices

dep.data[1,"X.Avg_salary"]
[1] 623.3

Get rows 1 and 2, and only column 2

Here we will supply row number of row indices and column number for column indices

dep.data[1:2,2]
[1] 100  20
dep.data[c(1:2),2]
[1] 100  20

This can also be done by supply row number of row indices and column NAME for column indices

dep.data[1:2,"X.Head_count"]
[1] 100  20


Get rows 1 and 2, and column 2 and 3

Here we will supply integer vector to row indices and character vector containing column names to column indices

dep.data[1:2,c("X.Dept_name","X.Head_count")]
  X.Dept_name X.Head_count
1  Production          100
2     Finance           20


Indexing with Boolean Vector

Boolean vectors are also widely used to extract values from an element in R

v <- c(1,4,4,3,2,2,3)
v > 2
[1] FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE

This will give a logical vector where we will have TRUE when v > 2 and FALSE otherwise


Supplying Boolean Vector at index position

Apart from integer values, we can also supply boolean values for extracting values from an element

v [c(T,T,F,F,F,F,F)]
[1] 1 4


Using ‘$’ sign in a data frame for extracting single column

dep.data$X.Dept_name
[1] "Production"      "Finance"         "HR"              "Quality Control"
[5] "Marketting"      "Sales"          


Using ‘[[]]’ sign in a data frame for extracting single column

This is a technique that I prefer while doing data manipulation. Both ‘$’ and ‘[[]]’ yields a vector but it is very convenient to use while working with dplyr function

dep.data[["X.Dept_name"]]
[1] "Production"      "Finance"         "HR"              "Quality Control"
[5] "Marketting"      "Sales"          


‘which’ function in R

which function returns the position of the elements in a vector which fulfil a particular condition. It can be simply read as ‘give me the index position of elements which fulfil a certain condiiton’

x <- c(1,5,8,4,6) 

# Position of elements having value greater then 3
which(x>3)
[1] 2 3 4 5
# Value of the elements where this consition is true
x[which(x>3)]
[1] 5 8 4 6


Practical Use Case

Lets say there is a data frame which has more than 13 columns. We want to create another data frame that has all the columns from the original data frame except two columns.In such as case, it would be very laborious to write the names of all the 11 columns. In this case we can use the ‘which’ function.Lets us look at all the inbuilt datasets in R under the dplyr package

if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(datasets)
}

data(package = "dplyr")
df<-starwars
colnames(df)
 [1] "name"       "height"     "mass"       "hair_color" "skin_color"
 [6] "eye_color"  "birth_year" "gender"     "homeworld"  "species"   
[11] "films"      "vehicles"   "starships" 


Lets say we want to remove species and homeworld column and store the rest of the data into another data frame df.interim

pos<-which(!colnames(df) %in% c("species","homeworld"))
colnames(df)[pos]
 [1] "name"       "height"     "mass"       "hair_color" "skin_color"
 [6] "eye_color"  "birth_year" "gender"     "films"      "vehicles"  
[11] "starships" 

We will now create a data frame using ‘pos’

df.interim<-df[,colnames(df)[pos]]
colnames(df.interim)
 [1] "name"       "height"     "mass"       "hair_color" "skin_color"
 [6] "eye_color"  "birth_year" "gender"     "films"      "vehicles"  
[11] "starships" 


Final Comments

In this blog we have seen how we can use indexing for various elements in R.Mostly which function will be used for data frames where large number of intermediate data frames are created to calculate the final results


No comments:

Post a Comment

Embed Shiny

Please wait...