Machine Learning Made Easy: Blog 5: Handling NA in R

Handling NA(Missing Values) in R

NA Values-What are they ?

All missing values in R are represented as NA. NA is of special interest in R as there are a lot of inbuilt functions to handle it. We will look at some examples of how to handle NA values within a vector and a data frame.

NA values within a vector

Lets create a vector and insert an NA value

s = c(1,2,3,NA) 
s

[1]  1  2  3 NA

‘s’ is a numeric vector with a single NA value.Lets see the class of ‘s’ to check the impact of NA on the data type

class(s)

[1] "numeric"

There is no impact of NA on the class of ‘s’

Get the total Count of NA in ‘s’

There is a function is.na() which check for the presence of NA.The result of this function is a logical vector with True values at indices where NA values are present and False elsewhere

is.na(s)

[1] FALSE FALSE FALSE  TRUE

If we take sum of is.na(s), we would get the total occurences of NA

sum(is.na(s))

[1] 1

Get the Index position of NA in ‘s’

Here we can use the ‘which’ function

which(is.na(s))

[1] 4

This will be useful when we are trying to replace/impute the NA within the vector.

Mathematical Operation on ‘s’

We will now look at the impact of NA on vector operations

t<-c(10,NA,12,NA)

Lets add ‘s’ to ‘t’

s + t

[1] 11 NA 15 NA

We can draw the following inferences

NA added to a number gives NA
NA added to a NA gives NA

Example related to addition has been shown but it applies to other Mathematical operators as well

Inbuilt Mathematical function on ‘s’

Applying inbuilt arithematic function such as mean() on ‘s’

mean(s)

[1] NA

It results in an NA. To get the mean without taking NA into account, we need to use the argument with the mean function. It is na.rm=T. Here we are specifically asking the function to compute the mean by removing NA from s

mean(s,na.rm=T)

[1] 2

Logical Operation on ‘s’

s1<-c(T,NA,F,NA,NA)
t1<-c(T,F,F,NA,T)

Lets AND ‘s1’ to ‘t1’

s1 & t1

[1]  TRUE FALSE FALSE    NA    NA

We can draw the following inferences

NA AND to F gives F
NA AND to NA gives NA
NA AND to T gives NA

Lets OR ‘s1’ to ‘t1’

s1 | t1

[1]  TRUE    NA FALSE    NA  TRUE

We can draw the following inferences

NA AND to F gives F
NA AND to NA gives NA
NA AND to T gives T

Lets NOT ‘s1’

!s1

[1] FALSE    NA  TRUE    NA    NA

We can draw the following inferences

Not on NA gives NA

Practical Use Case:NA and Data Frames

In a data frame, we can select rows or columns or both.So essentially we will be looking at ways to extract set of rows and/or subset of columns.Lets declare a data frame

if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(datasets)
}

data(package = "dplyr")
df<-starwars
colnames(df)

 [1] "name"       "height"     "mass"       "hair_color" "skin_color"
 [6] "eye_color"  "birth_year" "gender"     "homeworld"  "species"   
[11] "films"      "vehicles"   "starships"

head(df[,1:4])

# A tibble: 6 x 4
  name           height  mass hair_color 
  <chr>           <int> <dbl> <chr>      
1 Luke Skywalker    172    77 blond      
2 C-3PO             167    75 <NA>       
3 R2-D2              96    32 <NA>       
4 Darth Vader       202   136 none       
5 Leia Organa       150    49 brown      
6 Owen Lars         178   120 brown, grey

We can see that there are NA’s present in hair_color and gender columns. Lets us try and create a small report highlighting the NA count for each variable

l1<-list()
for(i in colnames(df)){
  
  Total_Count<-sum(is.na(df[,i]))
  temp.df<-data.frame(Variable=i,'Sum of NA'=Total_Count,stringsAsFactors = F)
  l1[[i]]<-temp.df
  
}

df_NA<-do.call(rbind.data.frame,l1)
row.names(df_NA)<-NULL
df_NA

     Variable Sum.of.NA
1        name         0
2      height         6
3        mass        28
4  hair_color         5
5  skin_color         0
6   eye_color         0
7  birth_year        44
8      gender         3
9   homeworld        10
10    species         5
11      films         0
12   vehicles         0
13  starships         0

NA Imputations

Based on the above summary on NA, it is clear that we need to replace them with suitable values before deriving any summary insights from it.Identifying columns for which the imputations/replacement needs to be done

required.columns<-df_NA[df_NA$Sum.of.NA > 0,][['Variable']]
required.columns

[1] "height"     "mass"       "hair_color" "birth_year" "gender"    
[6] "homeworld"  "species"

Height and mass are numeric columns while other are categorical in nature. The logic that we create should factor this fact

for(j in colnames(df)){
  
  if(j %in% c("height","mass")){
    
    temp<-mean(df[[j]],na.rm=T)
    df[which(is.na(df[,j])),j]<-temp
    
  }else if(j %in% required.columns[which( !required.columns %in% c("height","mass"))]){
    
    temp<-names(sort(table(df[[j]]),T)[1])
    df[which(is.na(df[,j])),j]<-temp
    
  }else{
    
    dummy<-1
  }
  
}

df now contains all the NA values replaced depending upon whether a column was numeric or categorical in nature. We can check this using the below piece of cide

l1_Check<-list()
for(i in colnames(df)){
  
  Total_Count<-sum(is.na(df[,i]))
  temp.df<-data.frame(Variable=i,'Sum of NA'=Total_Count,stringsAsFactors = F)
  l1_Check[[i]]<-temp.df
  
}

df_NA_Check<-do.call(rbind.data.frame,l1_Check)
row.names(df_NA_Check)<-NULL
df_NA_Check

     Variable Sum.of.NA
1        name         0
2      height         0
3        mass         0
4  hair_color         0
5  skin_color         0
6   eye_color         0
7  birth_year         0
8      gender         0
9   homeworld         0
10    species         0
11      films         0
12   vehicles         0
13  starships         0

Final Comments

In this blog we have seen how we can analyse the NA values within a element in R. There are a lot of inbuilt functions in R that helps us to estimate the total count of NAs, summary stats such as mean,etc. We also saw how we can use for loops to create a summary of variables and the NA count and also methods to do imputations.