String Vs Factors
Parag Verma
Introduction
In this Blog we will look at how to handle character string and factors. We will look the following topics
- Difference Btw Factor and Character Data Type
- Difference Btw Factor and Numeric Data Type
- Issues while writing the factor data to drive
- Handle Numeric data stored as factor
- Functions in R to handle both of them
Installing the library: dplyr,tidyr and stringr
if(!require("dplyr")){
install.packages("dplyr")
}else{
library(dplyr)
}
if(!require("tidyr")){
install.packages("tidyr")
}else{
library(tidyr)
}
if(!require("stringr")){
install.packages("stringr")
}else{
library(stringr)
}
Difference Btw Factor and Character
Lets look at the basic difference character and factor. Here we will look at an example by creating a character vector and how it can be converted into factor
x<-c("A","B","A","B","A","B")
x
[1] "A" "B" "A" "B" "A" "B"
class(x)
[1] "character"
if we check the class of x it gives character
Let us now convert it into a factor variable
y<-as.factor(x)
y
[1] A B A B A B
Levels: A B
As we can see, the class of y is factor. Along with the class, we also get Levels summary stating the unique values within y. In other words as.factor function changes the variable(character/numeric/dates) into nominal variable.The variable thus obtained is stored as levels
Difference Btw Factor and Numeric
Let us now create a numeric vector x3 and then convert it into factor.We will check what happens then
x3<-c(100:105)
x3
[1] 100 101 102 103 104 105
x4<-as.factor(x3)
x3==x4
[1] TRUE TRUE TRUE TRUE TRUE TRUE
Issues while writing the factor data to drive
Lets create a data frame with x,y,x3 and x4 as columns and write it as csv file.We will again read it to see the difference
df<-data.frame(C1=x,
C2=y,
C3=x3,
C4=x4
,stringsAsFactors = F)
str(df)
'data.frame': 6 obs. of 4 variables:
$ C1: chr "A" "B" "A" "B" ...
$ C2: Factor w/ 2 levels "A","B": 1 2 1 2 1 2
$ C3: int 100 101 102 103 104 105
$ C4: Factor w/ 6 levels "100","101","102",..: 1 2 3 4 5 6
write.csv(df,"dummyfile.csv",row.names = F)
We know that C4 is numeric but i stored as factor. If we convert it into numeric we will run into ISSUES. We will get 1,2,3,4,5,6 instead of 100,101,102,103,104,105.It is because of the fact that levels are convertd into numeric data isntead of value within C4
df2<-read.csv("dummyfile.csv",stringsAsFactors = F)
as.numeric(df[["C4"]])
[1] 1 2 3 4 5 6
Handle Numeric data stored as factor
There is a hack while dealing with the above scenario. Lets look at the below example
as.numeric(as.character(df2[["C4"]]))
[1] 100 101 102 103 104 105
The idea is to first convert the factor attribute to character and then ultimatley to numeric.In this way we will retain the values in C4 vector
Functions in R to handle both of them
There are certain functions/arguments in R that can be used to handle string to factor conversion.data.frame, read.csv, read.table and other similar functions have stringasFactor argument which can be set to the following values.
- stringasFactor=T, then all the character vector/Strings are stored as Factor data type
- stringasFactor=F, then all the character vector/Strings are stored as Character data type
Final Comments
There is a very small difference between factor and character data type. One needs to understand the nuances of how to handle them as they become important when we use different file read and write function
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html
No comments:
Post a Comment