Sunday, March 15, 2020

Blog 15: String vs Factors

String Vs Factors


Introduction

In this Blog we will look at how to handle character string and factors. We will look the following topics

  • Difference Btw Factor and Character Data Type
  • Difference Btw Factor and Numeric Data Type
  • Issues while writing the factor data to drive
  • Handle Numeric data stored as factor
  • Functions in R to handle both of them


Installing the library: dplyr,tidyr and stringr

if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(dplyr)
}

if(!require("tidyr")){
  
  install.packages("tidyr")
}else{
  
  library(tidyr)
}

if(!require("stringr")){
  
  install.packages("stringr")
}else{
  
  library(stringr)
}


Difference Btw Factor and Character

Lets look at the basic difference character and factor. Here we will look at an example by creating a character vector and how it can be converted into factor

x<-c("A","B","A","B","A","B")
x
[1] "A" "B" "A" "B" "A" "B"
class(x)
[1] "character"

if we check the class of x it gives character

Let us now convert it into a factor variable

y<-as.factor(x)
y
[1] A B A B A B
Levels: A B

As we can see, the class of y is factor. Along with the class, we also get Levels summary stating the unique values within y. In other words as.factor function changes the variable(character/numeric/dates) into nominal variable.The variable thus obtained is stored as levels

Difference Btw Factor and Numeric

Let us now create a numeric vector x3 and then convert it into factor.We will check what happens then

x3<-c(100:105)
x3
[1] 100 101 102 103 104 105
x4<-as.factor(x3)

x3==x4
[1] TRUE TRUE TRUE TRUE TRUE TRUE


Issues while writing the factor data to drive

Lets create a data frame with x,y,x3 and x4 as columns and write it as csv file.We will again read it to see the difference

df<-data.frame(C1=x,
               C2=y,
               C3=x3,
               C4=x4
               ,stringsAsFactors = F)

str(df)
'data.frame':   6 obs. of  4 variables:
 $ C1: chr  "A" "B" "A" "B" ...
 $ C2: Factor w/ 2 levels "A","B": 1 2 1 2 1 2
 $ C3: int  100 101 102 103 104 105
 $ C4: Factor w/ 6 levels "100","101","102",..: 1 2 3 4 5 6
write.csv(df,"dummyfile.csv",row.names = F)

We know that C4 is numeric but i stored as factor. If we convert it into numeric we will run into ISSUES. We will get 1,2,3,4,5,6 instead of 100,101,102,103,104,105.It is because of the fact that levels are convertd into numeric data isntead of value within C4

df2<-read.csv("dummyfile.csv",stringsAsFactors = F)

as.numeric(df[["C4"]])
[1] 1 2 3 4 5 6


Handle Numeric data stored as factor

There is a hack while dealing with the above scenario. Lets look at the below example

as.numeric(as.character(df2[["C4"]]))
[1] 100 101 102 103 104 105

The idea is to first convert the factor attribute to character and then ultimatley to numeric.In this way we will retain the values in C4 vector


Functions in R to handle both of them

There are certain functions/arguments in R that can be used to handle string to factor conversion.data.frame, read.csv, read.table and other similar functions have stringasFactor argument which can be set to the following values.

  • stringasFactor=T, then all the character vector/Strings are stored as Factor data type
  • stringasFactor=F, then all the character vector/Strings are stored as Character data type


Final Comments

There is a very small difference between factor and character data type. One needs to understand the nuances of how to handle them as they become important when we use different file read and write function


No comments:

Post a Comment

Embed Shiny

Please wait...