Handling NA(Missing Values) in R
Parag Verma
27th Dec, 2019
NA Values-What are they ?
All missing values in R are represented as NA. NA is of special interest in R as there are a lot of inbuilt functions to handle it. We will look at some examples of how to handle NA values within a vector and a data frame.
NA values within a vector
Lets create a vector and insert an NA value
s = c(1,2,3,NA)
s
[1] 1 2 3 NA
‘s’ is a numeric vector with a single NA value.Lets see the class of ‘s’ to check the impact of NA on the data type
class(s)
[1] "numeric"
There is no impact of NA on the class of ‘s’
Get the total Count of NA in ‘s’
There is a function is.na() which check for the presence of NA.The result of this function is a logical vector with True values at indices where NA values are present and False elsewhere
is.na(s)
[1] FALSE FALSE FALSE TRUE
If we take sum of is.na(s), we would get the total occurences of NA
sum(is.na(s))
[1] 1
Get the Index position of NA in ‘s’
Here we can use the ‘which’ function
which(is.na(s))
[1] 4
This will be useful when we are trying to replace/impute the NA within the vector.
Mathematical Operation on ‘s’
We will now look at the impact of NA on vector operations
t<-c(10,NA,12,NA)
Lets add ‘s’ to ‘t’
s + t
[1] 11 NA 15 NA
We can draw the following inferences
- NA added to a number gives NA
- NA added to a NA gives NA
Example related to addition has been shown but it applies to other Mathematical operators as well
Inbuilt Mathematical function on ‘s’
Applying inbuilt arithematic function such as mean() on ‘s’
mean(s)
[1] NA
It results in an NA. To get the mean without taking NA into account, we need to use the argument with the mean function. It is na.rm=T. Here we are specifically asking the function to compute the mean by removing NA from s
mean(s,na.rm=T)
[1] 2
Logical Operation on ‘s’
s1<-c(T,NA,F,NA,NA)
t1<-c(T,F,F,NA,T)
Lets AND ‘s1’ to ‘t1’
s1 & t1
[1] TRUE FALSE FALSE NA NA
We can draw the following inferences
- NA AND to F gives F
- NA AND to NA gives NA
- NA AND to T gives NA
Lets OR ‘s1’ to ‘t1’
s1 | t1
[1] TRUE NA FALSE NA TRUE
We can draw the following inferences
- NA AND to F gives F
- NA AND to NA gives NA
- NA AND to T gives T
Lets NOT ‘s1’
!s1
[1] FALSE NA TRUE NA NA
We can draw the following inferences
- Not on NA gives NA
Practical Use Case:NA and Data Frames
In a data frame, we can select rows or columns or both.So essentially we will be looking at ways to extract set of rows and/or subset of columns.Lets declare a data frame
if(!require("dplyr")){
install.packages("dplyr")
}else{
library(datasets)
}
data(package = "dplyr")
df<-starwars
colnames(df)
[1] "name" "height" "mass" "hair_color" "skin_color"
[6] "eye_color" "birth_year" "gender" "homeworld" "species"
[11] "films" "vehicles" "starships"
head(df[,1:4])
# A tibble: 6 x 4
name height mass hair_color
<chr> <int> <dbl> <chr>
1 Luke Skywalker 172 77 blond
2 C-3PO 167 75 <NA>
3 R2-D2 96 32 <NA>
4 Darth Vader 202 136 none
5 Leia Organa 150 49 brown
6 Owen Lars 178 120 brown, grey
We can see that there are NA’s present in hair_color and gender columns. Lets us try and create a small report highlighting the NA count for each variable
l1<-list()
for(i in colnames(df)){
Total_Count<-sum(is.na(df[,i]))
temp.df<-data.frame(Variable=i,'Sum of NA'=Total_Count,stringsAsFactors = F)
l1[[i]]<-temp.df
}
df_NA<-do.call(rbind.data.frame,l1)
row.names(df_NA)<-NULL
df_NA
Variable Sum.of.NA
1 name 0
2 height 6
3 mass 28
4 hair_color 5
5 skin_color 0
6 eye_color 0
7 birth_year 44
8 gender 3
9 homeworld 10
10 species 5
11 films 0
12 vehicles 0
13 starships 0
NA Imputations
Based on the above summary on NA, it is clear that we need to replace them with suitable values before deriving any summary insights from it.Identifying columns for which the imputations/replacement needs to be done
required.columns<-df_NA[df_NA$Sum.of.NA > 0,][['Variable']]
required.columns
[1] "height" "mass" "hair_color" "birth_year" "gender"
[6] "homeworld" "species"
Height and mass are numeric columns while other are categorical in nature. The logic that we create should factor this fact
for(j in colnames(df)){
if(j %in% c("height","mass")){
temp<-mean(df[[j]],na.rm=T)
df[which(is.na(df[,j])),j]<-temp
}else if(j %in% required.columns[which( !required.columns %in% c("height","mass"))]){
temp<-names(sort(table(df[[j]]),T)[1])
df[which(is.na(df[,j])),j]<-temp
}else{
dummy<-1
}
}
df now contains all the NA values replaced depending upon whether a column was numeric or categorical in nature. We can check this using the below piece of cide
l1_Check<-list()
for(i in colnames(df)){
Total_Count<-sum(is.na(df[,i]))
temp.df<-data.frame(Variable=i,'Sum of NA'=Total_Count,stringsAsFactors = F)
l1_Check[[i]]<-temp.df
}
df_NA_Check<-do.call(rbind.data.frame,l1_Check)
row.names(df_NA_Check)<-NULL
df_NA_Check
Variable Sum.of.NA
1 name 0
2 height 0
3 mass 0
4 hair_color 0
5 skin_color 0
6 eye_color 0
7 birth_year 0
8 gender 0
9 homeworld 0
10 species 0
11 films 0
12 vehicles 0
13 starships 0
Final Comments
In this blog we have seen how we can analyse the NA values within a element in R. There are a lot of inbuilt functions in R that helps us to estimate the total count of NAs, summary stats such as mean,etc. We also saw how we can use for loops to create a summary of variables and the NA count and also methods to do imputations.
Link to Previous R Blogs
Blog 1-Vectors,Matrics, Lists and Data Frame in R https://mlmadeeasy.blogspot.com/2019/12/2datatypesr.html
Blog 2 - Operators in R https://mlmadeeasy.blogspot.com/2019/12/blog-2-operators-in-r.html
Blog 3 - Loops in R https://mlmadeeasy.blogspot.com/2019/12/blog-3-loops-in-r.html
Blog 4 - Handling NA in R https://mlmadeeasy.blogspot.com/2019/12/blog-4-indexing-in-r.html
List of Datasets for Practise https://hofmann.public.iastate.edu/data_in_r_sortable.html
No comments:
Post a Comment