Saturday, June 20, 2020

Blog 27: Introduction to NLP in R

Natural Language Processing


Introduction

In this Blog we will look at how to handle text data in R. What libraries are used to manipulate text data and derive some summary stats out of it. Along with this, we will also look at some standard steps that are used in most of the NLP related projects.These steps can be divided into the following types

  • Tokenization
  • Lemmatization/Stemming
  • Removing Stop words
  • n-gram profile
  • Identify Important featues using Term Frequency
  • Create Document Term Matric(DTM)


Installing the library: tidytext along with dplyr,tidyr and stringr package

package.name<-c("tidytext","textstem","dplyr","tidyr","stringr")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}


Create a sample text data set

Lets create sample text data to understand key concepts better.

string_txt <- c("Roger Federer is undoubtedly the Greatest tennis player of all times",
                "His legacy is not in the number of grand slam championships",
                " he has won.",
                "He will defintely be remembered for the longevity of his career",
                " and how he was able to take care of his body over the years",
                "His return in 2017 and winning the Autralian open against his",
                " arch rival Nadal is considered to be a modern day spectacle",
                "The only thing left to achieve is the elusive",
                " Olympic gold in Tennis singles")




# In order to analyze this we need to convert it into a data frame
text_df<-data.frame(line=1:length(string_txt),text=string_txt,stringsAsFactors = F)
text_df
  line                                                                 text
1    1 Roger Federer is undoubtedly the Greatest tennis player of all times
2    2          His legacy is not in the number of grand slam championships
3    3                                                          he has won.
4    4      He will defintely be remembered for the longevity of his career
5    5          and how he was able to take care of his body over the years
6    6        His return in 2017 and winning the Autralian open against his
7    7          arch rival Nadal is considered to be a modern day spectacle
8    8                        The only thing left to achieve is the elusive
9    9                                       Olympic gold in Tennis singles


Step 1: Tokenization

Tokenization is basically breaking text into small chunks so that it can be analyzed properly.There is an unnest_token function within tidytext library that can help us do that

token.df<-text_df %>%
  unnest_tokens(word, text)

row.names(token.df)<-NULL
head(token.df)
  line        word
1    1       roger
2    1     federer
3    1          is
4    1 undoubtedly
5    1         the
6    1    greatest

We can see that the the text has been broken into individual chunks.These chunks are known as tokens. There is a column for row number created by the name line which can be used for grouping some frequency related metrics at row level. Next wewill look at how to convert different forms of the same word to the base or root form by using Lemmatization


Step 2: Lemmatization

token.df<-text_df %>%
  unnest_tokens(word, text)%>%
  mutate(word2=lemmatize_strings(word, dictionary = lexicon::hash_lemmas))


head(token.df%>%filter(word2 %in% c("win")))
  line    word word2
1    3     won   win
2    6 winning   win

We can see that won and winning have been converted into win form. Benefit of this is that there is a single representation of words with the same root which helps to aggregate data in a better way. It should also be noted that the process of lemmatization is an expensive one of terms of time and hence one should be careful in using this for huge dataset


Step 3: Removing Stop Words

Before we go ahead and create some summary stats, meaningless words sould be removed from the text corpus. Words such as the, a, there etc should be removed as they are not necessary for the analysis. There is an inbuilt stop word dictionary in R, however, we can create one of our in order to remove tokens/words not reqired for analyses

token.df<-text_df %>%
  unnest_tokens(word, text)%>%
  mutate(word2=lemmatize_strings(word, dictionary = lexicon::hash_lemmas))%>%
  select(-word)%>%
  rename(word=word2)%>%
  anti_join(stop_words)

head(token.df)
  line        word
1    1       roger
2    1     federer
3    1 undoubtedly
4    1      tennis
5    1      player
6    1        time


as can be seen, words such as the, is etc have been removed.Lets say we want to remove the word legagcy, This can be done by creating our own stop word dictionary as shown below

custom_words<-c("legacy")
Stopword_custom<-data.frame(word=custom_words,stringsAsFactors = F)%>%
  cbind("lexicon"="SMART")%>%
  rbind.data.frame(stop_words)

head(Stopword_custom)
    word lexicon
1 legacy   SMART
2      a   SMART
3    a's   SMART
4   able   SMART
5  about   SMART
6  above   SMART

the word legacy has been added to the stop word dictionary and now we can use this to remove legacy from text corpus


token.df<-text_df %>%
  unnest_tokens(word, text)%>%
  mutate(word2=lemmatize_strings(word, dictionary = lexicon::hash_lemmas))%>%
  select(-word)%>%
  rename(word=word2)%>%
  anti_join(Stopword_custom)
Joining, by = "word"
head(token.df)
  line        word
1    1       roger
2    1     federer
3    1 undoubtedly
4    1      tennis
5    1      player
6    1        time

As you can see, legacy has been removed from the word column

Step 4: n-gram profile__ Once the data is tokenised and properly set up, lets calculate some summary stats out of it


Step 4.a: Unigram__

# Count of tokens per row
unigram.df<-token.df%>%
  unnest_tokens(unigram, word, token = "ngrams", n = 1)%>%
  group_by(line,unigram)%>%
  summarise(Total_Count=n())

head(unigram.df)
# A tibble: 6 x 3
# Groups:   line [1]
   line unigram     Total_Count
  <int> <chr>             <int>
1     1 federer               1
2     1 player                1
3     1 roger                 1
4     1 tennis                1
5     1 time                  1
6     1 undoubtedly           1
# Count of tokens overall
unigram.total.df<-token.df%>%
  unnest_tokens(unigram, word, token = "ngrams", n = 1)%>%
  group_by(unigram)%>%
  summarise(Total_Count=n())%>%
  arrange(desc(Total_Count))

head(unigram.total.df)
# A tibble: 6 x 2
  unigram   Total_Count
  <chr>           <int>
1 tennis              2
2 win                 2
3 2017                1
4 achieve             1
5 arch                1
6 autralian           1


Step 4.b: bigram__

# Count of tokens per row
bigram.df<-token.df%>%
  unnest_tokens(bigram, word, token = "ngrams", n = 2)%>%
  group_by(line,bigram)%>%
  summarise(Total_Count=n())

head(bigram.df)
# A tibble: 6 x 3
# Groups:   line [2]
   line bigram              Total_Count
  <int> <chr>                     <int>
1     1 federer undoubtedly           1
2     1 player time                   1
3     1 roger federer                 1
4     1 tennis player                 1
5     1 undoubtedly tennis            1
6     2 grand slam                    1
# Count of tokens overall
bigram.total.df<-token.df%>%
  unnest_tokens(bigram, word, token = "ngrams", n = 2)%>%
  group_by(bigram)%>%
  summarise(Total_Count=n())%>%
  arrange(desc(Total_Count))

head(bigram.total.df)
# A tibble: 6 x 2
  bigram             Total_Count
  <chr>                    <int>
1 2017 win                     1
2 achieve elusive              1
3 arch rival                   1
4 care body                    1
5 day spectacle                1
6 defintely remember           1


Step 4.c: trigram__

# Count of tokens per row
trigram.df<-token.df%>%
  unnest_tokens(trigram, word, token = "ngrams", n = 3)%>%
  group_by(line,trigram)%>%
  summarise(Total_Count=n())

head(trigram.df)
# A tibble: 6 x 3
# Groups:   line [2]
   line trigram                    Total_Count
  <int> <chr>                            <int>
1     1 federer undoubtedly tennis           1
2     1 roger federer undoubtedly            1
3     1 tennis player time                   1
4     1 undoubtedly tennis player            1
5     2 grand slam championship              1
6     2 numb grand slam                      1
# Count of tokens overall
trigram.total.df<-token.df%>%
  unnest_tokens(trigram, word, token = "ngrams", n = 3)%>%
  group_by(trigram)%>%
  summarise(Total_Count=n())%>%
  arrange(desc(Total_Count))

head(trigram.total.df)
# A tibble: 6 x 2
  trigram                      Total_Count
  <chr>                              <int>
1 <NA>                                   2
2 2017 win autralian                     1
3 arch rival nadal                       1
4 defintely remember longevity           1
5 federer undoubtedly tennis             1
6 gold tennis single                     1


Step 5: Identify important features__

Combining the unigram, bigram, trigram into a single data frame so that we can use some filter criteria on Total_Count variable and extract important features

ngram.df<-rbind.data.frame(unigram.total.df%>%rename(Features=unigram),
                             bigram.total.df%>%rename(Features=bigram),
                           trigram.total.df%>%rename(Features=trigram))

head(ngram.df)
# A tibble: 6 x 2
  Features  Total_Count
  <chr>           <int>
1 tennis              2
2 win                 2
3 2017                1
4 achieve             1
5 arch                1
6 autralian           1

Lets say we decide that the ngram words with frequency count more than 1 are important in the sense that they represent the text data at hand. We can extarct such ngram using the below codes

features<-ngram.df%>%
  filter(Total_Count > 1)%>%
  filter(!is.na(Features))%>%
  select(Features)
# %>%
#   pull(Features)

features
# A tibble: 2 x 1
  Features
  <chr>   
1 tennis  
2 win     


Step 6: Creation of Document Term Matrix__

Once features have been shorlisted, we can go ahead and create the document term matrix where each row would represent the text record and columns would represent the features identified. Eseentially we are converting unstructured text data to structured format

feature.df<-rbind.data.frame(unigram.df%>%rename(Features=unigram),
                             bigram.df%>%rename(Features=bigram),
                           trigram.df%>%rename(Features=trigram))%>%
            inner_join(features,by="Features")


head(feature.df)
# A tibble: 4 x 3
# Groups:   line [4]
   line Features Total_Count
  <int> <chr>          <int>
1     1 tennis             1
2     3 win                1
3     6 win                1
4     9 tennis             1

You can see that in the process of mapping the text data with Features, we have lost row number 2,4,5,7 and 8. In order to avoid dropping off records, lets add “dummy” text to all the line records to the above data frame

feature.df<-rbind.data.frame(unigram.df%>%rename(Features=unigram),
                             bigram.df%>%rename(Features=bigram),
                           trigram.df%>%rename(Features=trigram))%>%
  rbind.data.frame(data.frame(line=1:nrow(text_df),Features="dummy",Total_Count=1))%>%
            inner_join(rbind.data.frame(features,"dummy"),by="Features")

feature.df
# A tibble: 13 x 3
# Groups:   line [9]
    line Features Total_Count
   <int> <chr>          <dbl>
 1     1 tennis             1
 2     3 win                1
 3     6 win                1
 4     9 tennis             1
 5     1 dummy              1
 6     2 dummy              1
 7     3 dummy              1
 8     4 dummy              1
 9     5 dummy              1
10     6 dummy              1
11     7 dummy              1
12     8 dummy              1
13     9 dummy              1

Now we can use the spread function to convert the above data frame from long to wide (or from unpivotted to pivotted)

DTM<-feature.df%>%
  spread(Features,"Total_Count",fill=0)%>%
  select(-dummy)

DTM
# A tibble: 9 x 3
# Groups:   line [9]
   line tennis   win
  <int>  <dbl> <dbl>
1     1      1     0
2     2      0     0
3     3      0     1
4     4      0     0
5     5      0     0
6     6      0     1
7     7      0     0
8     8      0     0
9     9      1     0


Final Comments

We saw how text data can be easily cleaned, summarised and converted into a document term matrix. The above data set is now much like any other data set where suitable ML algorithm can be applied to it. The steps shown in the blog are the most important as they are repeated in all text projects and are hence the foundation of NLP in R

No comments:

Post a Comment

Embed Shiny

Please wait...