Machine Learning Made Easy

Term Frequency Inverse Document Frequency

Introduction

In this Blog we will look at how to use TF-IDF metric to analyse text. We will take the text example from the previous blog to understand key differences between Term Frequency and Term Frequency Inverse Document Frequency approach.Boradly we will look at the following topics

What is TF-IDF
Calculating TF-IDF on text data
Creating the Document Term Matrix(DTM)

Installing the library: tidytext along with dplyr,tidyr and stringr package

package.name<-c("tidytext","textstem","dplyr","tidyr","stringr")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

What is Term Frequency Inverse Document Frequency (TF-IDF)

It is a metric calculated by multiplying the frequency (TF) of a word by the inverse document frequency(IDF). IDF decreases the weight of the commonly used terms and increases the importance of words that are not used much in the text data. TF * IDF (TF-IDF) is nothing but frequency of the term adjusted to how rarely it is used

We already know the formulae of TF. Lets look at how to calculate IDF

IDF is equal to \(\mathrm{ln}(n/n_{word})\) where
n is the total number of records in the data set
\(n_{word}\) total number of documents in which the word appears

Create a sample text data set

Lets create sample text data to understand key concepts better.

string_txt <- c("Roger Federer is undoubtedly the Greatest tennis player of all times",
                "His legacy is not in the number of grand slam championships",
                " he has won.",
                "He will defintely be remembered for the longevity of his career",
                " and how he was able to take care of his body over the years",
                "His return in 2017 and winning the Autralian open against his",
                " arch rival Nadal is considered to be a modern day spectacle",
                "The only thing left to achieve is the elusive",
                " Olympic gold in Tennis singles")




# In order to analyze this we need to convert it into a data frame
text_df<-data.frame(line=1:length(string_txt),text=string_txt,stringsAsFactors = F)
text_df

  line                                                                 text
1    1 Roger Federer is undoubtedly the Greatest tennis player of all times
2    2          His legacy is not in the number of grand slam championships
3    3                                                          he has won.
4    4      He will defintely be remembered for the longevity of his career
5    5          and how he was able to take care of his body over the years
6    6        His return in 2017 and winning the Autralian open against his
7    7          arch rival Nadal is considered to be a modern day spectacle
8    8                        The only thing left to achieve is the elusive
9    9                                       Olympic gold in Tennis singles

Step 1: Tokenization,lemmatization and Removing Stop words

custom_words<-c("legacy")
Stopword_custom<-data.frame(word=custom_words,stringsAsFactors = F)%>%
  cbind("lexicon"="SMART")%>%
  rbind.data.frame(stop_words)

token.df<-text_df %>%
  unnest_tokens(word, text)%>%
  mutate(word2=lemmatize_strings(word, dictionary = lexicon::hash_lemmas))%>%
  select(-word)%>%
  rename(word=word2)%>%
  anti_join(Stopword_custom)

head(token.df)

  line        word
1    1       roger
2    1     federer
3    1 undoubtedly
4    1      tennis
5    1      player
6    1        time

We can see that the the text has been broken into individual chunks.These chunks are known as tokens. There is a column for row number created by the name line which can be used for grouping some frequency related metrics at row level. We have also used lemmatization and stop word removal to standardise text data

Step 2: Calculating TF-IDF for unigrams

tf_idf.unigram<-token.df%>%
  group_by(line,word)%>%
  summarise(Total_Count=n())%>%
  bind_tf_idf(word, line, Total_Count)
  
head(tf_idf.unigram,8)

# A tibble: 8 x 6
# Groups:   line [2]
   line word         Total_Count    tf   idf tf_idf
  <int> <chr>              <int> <dbl> <dbl>  <dbl>
1     1 federer                1 0.167  2.20  0.366
2     1 player                 1 0.167  2.20  0.366
3     1 roger                  1 0.167  2.20  0.366
4     1 tennis                 1 0.167  1.50  0.251
5     1 time                   1 0.167  2.20  0.366
6     1 undoubtedly            1 0.167  2.20  0.366
7     2 championship           1 0.25   2.20  0.549
8     2 grand                  1 0.25   2.20  0.549

Lets take line 1 and go through some of the values of tf and idf. For the word ‘federer’ * TF:There are 6 words in line 1 and all appear only once. Hence term frequency for each will be 1/6 which is around 0.16. * IDF:There are a total of 9 documents(rows of text data) and federer appears only in the first one. so ln(9/1) is around 2.19 * TF-IDF: 0.16x2.19 gives 0.36

similar inference can be made for other words as well

Step 2.b: Calculating TF-IDF for unigrams,bigrams and trigrams

Here we will calculate TF-IDF scores individually for unigrams, bigrams and trigrams and then combine all the three results together

unigram.df<-token.df%>%
  unnest_tokens(features, word, token = "ngrams", n = 1)%>%
  group_by(line,features)%>%
  summarise(Total_Count=n())%>%
  bind_tf_idf(features, line, Total_Count)


bigram.df<-token.df%>%
  unnest_tokens(features, word, token = "ngrams", n = 2)%>%
  group_by(line,features)%>%
  summarise(Total_Count=n())%>%
  bind_tf_idf(features, line, Total_Count)

trigram.df<-token.df%>%
  unnest_tokens(features, word, token = "ngrams", n = 3)%>%
  group_by(line,features)%>%
  summarise(Total_Count=n())%>%
  bind_tf_idf(features, line, Total_Count)

ngram.df<-rbind.data.frame(unigram.df,bigram.df,trigram.df)%>%
  arrange(desc(tf_idf))
  
head(ngram.df,20)

# A tibble: 20 x 6
# Groups:   line [7]
    line features                     Total_Count    tf   idf tf_idf
   <int> <chr>                              <int> <dbl> <dbl>  <dbl>
 1     5 care body                              1 1      2.20  2.20 
 2     8 leave achieve elusive                  1 1      2.20  2.20 
 3     3 win                                    1 1      1.50  1.50 
 4     5 body                                   1 0.5    2.20  1.10 
 5     5 care                                   1 0.5    2.20  1.10 
 6     8 achieve elusive                        1 0.5    2.20  1.10 
 7     8 leave achieve                          1 0.5    2.20  1.10 
 8     2 grand slam championship                1 0.5    2.20  1.10 
 9     2 numb grand slam                        1 0.5    2.20  1.10 
10     4 defintely remember longevity           1 0.5    2.20  1.10 
11     4 remember longevity career              1 0.5    2.20  1.10 
12     6 2017 win autralian                     1 0.5    2.20  1.10 
13     6 return 2017 win                        1 0.5    2.20  1.10 
14     9 gold tennis single                     1 0.5    2.20  1.10 
15     9 olympic gold tennis                    1 0.5    2.20  1.10 
16     8 achieve                                1 0.333  2.20  0.732
17     8 elusive                                1 0.333  2.20  0.732
18     8 leave                                  1 0.333  2.20  0.732
19     2 grand slam                             1 0.333  2.20  0.732
20     2 numb grand                             1 0.333  2.20  0.732

As you can see, legacy has been removed from the word column

Step 3:Creating the DTM

Lets use a tf-idf value of more than 2 for feature selection

features<-ngram.df%>%
  ungroup()%>%
  filter(tf_idf > 2)%>%
  filter(!is.na(features))%>%
  select(features)

features

# A tibble: 2 x 1
  features             
  <chr>                
1 care body            
2 leave achieve elusive

Once features have been shorlisted, we can go ahead and create the document term matrix where each row would represent the text record and columns would represent the features identified. Eseentially we are converting unstructured text data to structured format

feature.df<-ngram.df%>%
  select(line,features,tf_idf)%>%
            inner_join(features,by="features")


head(feature.df)

# A tibble: 2 x 3
# Groups:   line [2]
   line features              tf_idf
  <int> <chr>                  <dbl>
1     5 care body               2.20
2     8 leave achieve elusive   2.20

You can see that in the process of mapping the text data with Features, we have lost row number 1,2,3,4,6 and 7 . In order to avoid dropping off records, lets add “dummy” text to all the line records to the above data frame

feature.df<-ngram.df%>%
  select(line,features,tf_idf)%>%
  rbind.data.frame(data.frame(line=1:nrow(text_df),features="dummy",tf_idf=1))%>%
            inner_join(rbind.data.frame(features,"dummy"),by="features")

feature.df

# A tibble: 11 x 3
# Groups:   line [9]
    line features              tf_idf
   <int> <chr>                  <dbl>
 1     5 care body               2.20
 2     8 leave achieve elusive   2.20
 3     1 dummy                   1   
 4     2 dummy                   1   
 5     3 dummy                   1   
 6     4 dummy                   1   
 7     5 dummy                   1   
 8     6 dummy                   1   
 9     7 dummy                   1   
10     8 dummy                   1   
11     9 dummy                   1

Now we can use the spread function to convert the above data frame from long to wide (or from unpivotted to pivotted)

DTM<-feature.df%>%
  spread(features,"tf_idf",fill=0)%>%
  select(-dummy)

DTM

# A tibble: 9 x 3
# Groups:   line [9]
   line `care body` `leave achieve elusive`
  <int>       <dbl>                   <dbl>
1     1        0                       0   
2     2        0                       0   
3     3        0                       0   
4     4        0                       0   
5     5        2.20                    0   
6     6        0                       0   
7     7        0                       0   
8     8        0                       2.20
9     9        0                       0

Final Comments

We saw how text data can be easily analysed using TF-IDF metric by running through a simple example and understanding the calcualtion behind the metric. Next blog will focus on a use case around using a topic model to divide text data into meaningful topics

Link to Previous R Blogs

https://www.aimlmadeeasy.com/2020/06/r-complete-guide.html

List of Datasets for Practise

https://hofmann.public.iastate.edu/data_in_r_sortable.html

https://vincentarelbundock.github.io/Rdatasets/datasets.html