Natural Language Processing
Parag Verma
Introduction
In this Blog we will look at how to handle text data in R. What libraries are used to manipulate text data and derive some summary stats out of it. Along with this, we will also look at some standard steps that are used in most of the NLP related projects.These steps can be divided into the following types
- Tokenization
- Lemmatization/Stemming
- Removing Stop words
- n-gram profile
- Identify Important featues using Term Frequency
- Create Document Term Matric(DTM)
Installing the library: tidytext along with dplyr,tidyr and stringr package
package.name<-c("tidytext","textstem","dplyr","tidyr","stringr")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
Create a sample text data set
Lets create sample text data to understand key concepts better.
string_txt <- c("Roger Federer is undoubtedly the Greatest tennis player of all times",
"His legacy is not in the number of grand slam championships",
" he has won.",
"He will defintely be remembered for the longevity of his career",
" and how he was able to take care of his body over the years",
"His return in 2017 and winning the Autralian open against his",
" arch rival Nadal is considered to be a modern day spectacle",
"The only thing left to achieve is the elusive",
" Olympic gold in Tennis singles")
# In order to analyze this we need to convert it into a data frame
text_df<-data.frame(line=1:length(string_txt),text=string_txt,stringsAsFactors = F)
text_df
line text
1 1 Roger Federer is undoubtedly the Greatest tennis player of all times
2 2 His legacy is not in the number of grand slam championships
3 3 he has won.
4 4 He will defintely be remembered for the longevity of his career
5 5 and how he was able to take care of his body over the years
6 6 His return in 2017 and winning the Autralian open against his
7 7 arch rival Nadal is considered to be a modern day spectacle
8 8 The only thing left to achieve is the elusive
9 9 Olympic gold in Tennis singles
Step 1: Tokenization
Tokenization is basically breaking text into small chunks so that it can be analyzed properly.There is an unnest_token function within tidytext library that can help us do that
token.df<-text_df %>%
unnest_tokens(word, text)
row.names(token.df)<-NULL
head(token.df)
line word
1 1 roger
2 1 federer
3 1 is
4 1 undoubtedly
5 1 the
6 1 greatest
We can see that the the text has been broken into individual chunks.These chunks are known as tokens. There is a column for row number created by the name line which can be used for grouping some frequency related metrics at row level. Next wewill look at how to convert different forms of the same word to the base or root form by using Lemmatization
Step 2: Lemmatization
token.df<-text_df %>%
unnest_tokens(word, text)%>%
mutate(word2=lemmatize_strings(word, dictionary = lexicon::hash_lemmas))
head(token.df%>%filter(word2 %in% c("win")))
line word word2
1 3 won win
2 6 winning win
We can see that won and winning have been converted into win form. Benefit of this is that there is a single representation of words with the same root which helps to aggregate data in a better way. It should also be noted that the process of lemmatization is an expensive one of terms of time and hence one should be careful in using this for huge dataset
Step 3: Removing Stop Words
Before we go ahead and create some summary stats, meaningless words sould be removed from the text corpus. Words such as the, a, there etc should be removed as they are not necessary for the analysis. There is an inbuilt stop word dictionary in R, however, we can create one of our in order to remove tokens/words not reqired for analyses
token.df<-text_df %>%
unnest_tokens(word, text)%>%
mutate(word2=lemmatize_strings(word, dictionary = lexicon::hash_lemmas))%>%
select(-word)%>%
rename(word=word2)%>%
anti_join(stop_words)
head(token.df)
line word
1 1 roger
2 1 federer
3 1 undoubtedly
4 1 tennis
5 1 player
6 1 time
as can be seen, words such as the, is etc have been removed.Lets say we want to remove the word legagcy, This can be done by creating our own stop word dictionary as shown below
custom_words<-c("legacy")
Stopword_custom<-data.frame(word=custom_words,stringsAsFactors = F)%>%
cbind("lexicon"="SMART")%>%
rbind.data.frame(stop_words)
head(Stopword_custom)
word lexicon
1 legacy SMART
2 a SMART
3 a's SMART
4 able SMART
5 about SMART
6 above SMART
the word legacy has been added to the stop word dictionary and now we can use this to remove legacy from text corpus
token.df<-text_df %>%
unnest_tokens(word, text)%>%
mutate(word2=lemmatize_strings(word, dictionary = lexicon::hash_lemmas))%>%
select(-word)%>%
rename(word=word2)%>%
anti_join(Stopword_custom)
Joining, by = "word"
head(token.df)
line word
1 1 roger
2 1 federer
3 1 undoubtedly
4 1 tennis
5 1 player
6 1 time
As you can see, legacy has been removed from the word column
Step 4: n-gram profile__ Once the data is tokenised and properly set up, lets calculate some summary stats out of it
Step 4.a: Unigram__
# Count of tokens per row
unigram.df<-token.df%>%
unnest_tokens(unigram, word, token = "ngrams", n = 1)%>%
group_by(line,unigram)%>%
summarise(Total_Count=n())
head(unigram.df)
# A tibble: 6 x 3
# Groups: line [1]
line unigram Total_Count
<int> <chr> <int>
1 1 federer 1
2 1 player 1
3 1 roger 1
4 1 tennis 1
5 1 time 1
6 1 undoubtedly 1
# Count of tokens overall
unigram.total.df<-token.df%>%
unnest_tokens(unigram, word, token = "ngrams", n = 1)%>%
group_by(unigram)%>%
summarise(Total_Count=n())%>%
arrange(desc(Total_Count))
head(unigram.total.df)
# A tibble: 6 x 2
unigram Total_Count
<chr> <int>
1 tennis 2
2 win 2
3 2017 1
4 achieve 1
5 arch 1
6 autralian 1
Step 4.b: bigram__
# Count of tokens per row
bigram.df<-token.df%>%
unnest_tokens(bigram, word, token = "ngrams", n = 2)%>%
group_by(line,bigram)%>%
summarise(Total_Count=n())
head(bigram.df)
# A tibble: 6 x 3
# Groups: line [2]
line bigram Total_Count
<int> <chr> <int>
1 1 federer undoubtedly 1
2 1 player time 1
3 1 roger federer 1
4 1 tennis player 1
5 1 undoubtedly tennis 1
6 2 grand slam 1
# Count of tokens overall
bigram.total.df<-token.df%>%
unnest_tokens(bigram, word, token = "ngrams", n = 2)%>%
group_by(bigram)%>%
summarise(Total_Count=n())%>%
arrange(desc(Total_Count))
head(bigram.total.df)
# A tibble: 6 x 2
bigram Total_Count
<chr> <int>
1 2017 win 1
2 achieve elusive 1
3 arch rival 1
4 care body 1
5 day spectacle 1
6 defintely remember 1
Step 4.c: trigram__
# Count of tokens per row
trigram.df<-token.df%>%
unnest_tokens(trigram, word, token = "ngrams", n = 3)%>%
group_by(line,trigram)%>%
summarise(Total_Count=n())
head(trigram.df)
# A tibble: 6 x 3
# Groups: line [2]
line trigram Total_Count
<int> <chr> <int>
1 1 federer undoubtedly tennis 1
2 1 roger federer undoubtedly 1
3 1 tennis player time 1
4 1 undoubtedly tennis player 1
5 2 grand slam championship 1
6 2 numb grand slam 1
# Count of tokens overall
trigram.total.df<-token.df%>%
unnest_tokens(trigram, word, token = "ngrams", n = 3)%>%
group_by(trigram)%>%
summarise(Total_Count=n())%>%
arrange(desc(Total_Count))
head(trigram.total.df)
# A tibble: 6 x 2
trigram Total_Count
<chr> <int>
1 <NA> 2
2 2017 win autralian 1
3 arch rival nadal 1
4 defintely remember longevity 1
5 federer undoubtedly tennis 1
6 gold tennis single 1
Step 5: Identify important features__
Combining the unigram, bigram, trigram into a single data frame so that we can use some filter criteria on Total_Count variable and extract important features
ngram.df<-rbind.data.frame(unigram.total.df%>%rename(Features=unigram),
bigram.total.df%>%rename(Features=bigram),
trigram.total.df%>%rename(Features=trigram))
head(ngram.df)
# A tibble: 6 x 2
Features Total_Count
<chr> <int>
1 tennis 2
2 win 2
3 2017 1
4 achieve 1
5 arch 1
6 autralian 1
Lets say we decide that the ngram words with frequency count more than 1 are important in the sense that they represent the text data at hand. We can extarct such ngram using the below codes
features<-ngram.df%>%
filter(Total_Count > 1)%>%
filter(!is.na(Features))%>%
select(Features)
# %>%
# pull(Features)
features
# A tibble: 2 x 1
Features
<chr>
1 tennis
2 win
Step 6: Creation of Document Term Matrix__
Once features have been shorlisted, we can go ahead and create the document term matrix where each row would represent the text record and columns would represent the features identified. Eseentially we are converting unstructured text data to structured format
feature.df<-rbind.data.frame(unigram.df%>%rename(Features=unigram),
bigram.df%>%rename(Features=bigram),
trigram.df%>%rename(Features=trigram))%>%
inner_join(features,by="Features")
head(feature.df)
# A tibble: 4 x 3
# Groups: line [4]
line Features Total_Count
<int> <chr> <int>
1 1 tennis 1
2 3 win 1
3 6 win 1
4 9 tennis 1
You can see that in the process of mapping the text data with Features, we have lost row number 2,4,5,7 and 8. In order to avoid dropping off records, lets add “dummy” text to all the line records to the above data frame
feature.df<-rbind.data.frame(unigram.df%>%rename(Features=unigram),
bigram.df%>%rename(Features=bigram),
trigram.df%>%rename(Features=trigram))%>%
rbind.data.frame(data.frame(line=1:nrow(text_df),Features="dummy",Total_Count=1))%>%
inner_join(rbind.data.frame(features,"dummy"),by="Features")
feature.df
# A tibble: 13 x 3
# Groups: line [9]
line Features Total_Count
<int> <chr> <dbl>
1 1 tennis 1
2 3 win 1
3 6 win 1
4 9 tennis 1
5 1 dummy 1
6 2 dummy 1
7 3 dummy 1
8 4 dummy 1
9 5 dummy 1
10 6 dummy 1
11 7 dummy 1
12 8 dummy 1
13 9 dummy 1
Now we can use the spread function to convert the above data frame from long to wide (or from unpivotted to pivotted)
DTM<-feature.df%>%
spread(Features,"Total_Count",fill=0)%>%
select(-dummy)
DTM
# A tibble: 9 x 3
# Groups: line [9]
line tennis win
<int> <dbl> <dbl>
1 1 1 0
2 2 0 0
3 3 0 1
4 4 0 0
5 5 0 0
6 6 0 1
7 7 0 0
8 8 0 0
9 9 1 0
Final Comments
We saw how text data can be easily cleaned, summarised and converted into a document term matrix. The above data set is now much like any other data set where suitable ML algorithm can be applied to it. The steps shown in the blog are the most important as they are repeated in all text projects and are hence the foundation of NLP in R
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html