Term Frequency Inverse Document Frequency
Parag Verma
Introduction
In this Blog we will look at how to use TF-IDF metric to analyse text. We will take the text example from the previous blog to understand key differences between Term Frequency and Term Frequency Inverse Document Frequency approach.Boradly we will look at the following topics
- What is TF-IDF
- Calculating TF-IDF on text data
- Creating the Document Term Matrix(DTM)
Installing the library: tidytext along with dplyr,tidyr and stringr package
package.name<-c("tidytext","textstem","dplyr","tidyr","stringr")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
What is Term Frequency Inverse Document Frequency (TF-IDF)
It is a metric calculated by multiplying the frequency (TF) of a word by the inverse document frequency(IDF). IDF decreases the weight of the commonly used terms and increases the importance of words that are not used much in the text data. TF * IDF (TF-IDF) is nothing but frequency of the term adjusted to how rarely it is used
We already know the formulae of TF. Lets look at how to calculate IDF
IDF is equal to \(\mathrm{ln}(n/n_{word})\) where
n is the total number of records in the data set
\(n_{word}\) total number of documents in which the word appears
Create a sample text data set
Lets create sample text data to understand key concepts better.
string_txt <- c("Roger Federer is undoubtedly the Greatest tennis player of all times",
"His legacy is not in the number of grand slam championships",
" he has won.",
"He will defintely be remembered for the longevity of his career",
" and how he was able to take care of his body over the years",
"His return in 2017 and winning the Autralian open against his",
" arch rival Nadal is considered to be a modern day spectacle",
"The only thing left to achieve is the elusive",
" Olympic gold in Tennis singles")
# In order to analyze this we need to convert it into a data frame
text_df<-data.frame(line=1:length(string_txt),text=string_txt,stringsAsFactors = F)
text_df
line text
1 1 Roger Federer is undoubtedly the Greatest tennis player of all times
2 2 His legacy is not in the number of grand slam championships
3 3 he has won.
4 4 He will defintely be remembered for the longevity of his career
5 5 and how he was able to take care of his body over the years
6 6 His return in 2017 and winning the Autralian open against his
7 7 arch rival Nadal is considered to be a modern day spectacle
8 8 The only thing left to achieve is the elusive
9 9 Olympic gold in Tennis singles
Step 1: Tokenization,lemmatization and Removing Stop words
custom_words<-c("legacy")
Stopword_custom<-data.frame(word=custom_words,stringsAsFactors = F)%>%
cbind("lexicon"="SMART")%>%
rbind.data.frame(stop_words)
token.df<-text_df %>%
unnest_tokens(word, text)%>%
mutate(word2=lemmatize_strings(word, dictionary = lexicon::hash_lemmas))%>%
select(-word)%>%
rename(word=word2)%>%
anti_join(Stopword_custom)
head(token.df)
line word
1 1 roger
2 1 federer
3 1 undoubtedly
4 1 tennis
5 1 player
6 1 time
We can see that the the text has been broken into individual chunks.These chunks are known as tokens. There is a column for row number created by the name line which can be used for grouping some frequency related metrics at row level. We have also used lemmatization and stop word removal to standardise text data
Step 2: Calculating TF-IDF for unigrams
tf_idf.unigram<-token.df%>%
group_by(line,word)%>%
summarise(Total_Count=n())%>%
bind_tf_idf(word, line, Total_Count)
head(tf_idf.unigram,8)
# A tibble: 8 x 6
# Groups: line [2]
line word Total_Count tf idf tf_idf
<int> <chr> <int> <dbl> <dbl> <dbl>
1 1 federer 1 0.167 2.20 0.366
2 1 player 1 0.167 2.20 0.366
3 1 roger 1 0.167 2.20 0.366
4 1 tennis 1 0.167 1.50 0.251
5 1 time 1 0.167 2.20 0.366
6 1 undoubtedly 1 0.167 2.20 0.366
7 2 championship 1 0.25 2.20 0.549
8 2 grand 1 0.25 2.20 0.549
Lets take line 1 and go through some of the values of tf and idf. For the word ‘federer’ * TF:There are 6 words in line 1 and all appear only once. Hence term frequency for each will be 1/6 which is around 0.16. * IDF:There are a total of 9 documents(rows of text data) and federer appears only in the first one. so ln(9/1) is around 2.19 * TF-IDF: 0.16x2.19 gives 0.36
similar inference can be made for other words as well
Step 2.b: Calculating TF-IDF for unigrams,bigrams and trigrams
Here we will calculate TF-IDF scores individually for unigrams, bigrams and trigrams and then combine all the three results together
unigram.df<-token.df%>%
unnest_tokens(features, word, token = "ngrams", n = 1)%>%
group_by(line,features)%>%
summarise(Total_Count=n())%>%
bind_tf_idf(features, line, Total_Count)
bigram.df<-token.df%>%
unnest_tokens(features, word, token = "ngrams", n = 2)%>%
group_by(line,features)%>%
summarise(Total_Count=n())%>%
bind_tf_idf(features, line, Total_Count)
trigram.df<-token.df%>%
unnest_tokens(features, word, token = "ngrams", n = 3)%>%
group_by(line,features)%>%
summarise(Total_Count=n())%>%
bind_tf_idf(features, line, Total_Count)
ngram.df<-rbind.data.frame(unigram.df,bigram.df,trigram.df)%>%
arrange(desc(tf_idf))
head(ngram.df,20)
# A tibble: 20 x 6
# Groups: line [7]
line features Total_Count tf idf tf_idf
<int> <chr> <int> <dbl> <dbl> <dbl>
1 5 care body 1 1 2.20 2.20
2 8 leave achieve elusive 1 1 2.20 2.20
3 3 win 1 1 1.50 1.50
4 5 body 1 0.5 2.20 1.10
5 5 care 1 0.5 2.20 1.10
6 8 achieve elusive 1 0.5 2.20 1.10
7 8 leave achieve 1 0.5 2.20 1.10
8 2 grand slam championship 1 0.5 2.20 1.10
9 2 numb grand slam 1 0.5 2.20 1.10
10 4 defintely remember longevity 1 0.5 2.20 1.10
11 4 remember longevity career 1 0.5 2.20 1.10
12 6 2017 win autralian 1 0.5 2.20 1.10
13 6 return 2017 win 1 0.5 2.20 1.10
14 9 gold tennis single 1 0.5 2.20 1.10
15 9 olympic gold tennis 1 0.5 2.20 1.10
16 8 achieve 1 0.333 2.20 0.732
17 8 elusive 1 0.333 2.20 0.732
18 8 leave 1 0.333 2.20 0.732
19 2 grand slam 1 0.333 2.20 0.732
20 2 numb grand 1 0.333 2.20 0.732
As you can see, legacy has been removed from the word column
Step 3:Creating the DTM
Lets use a tf-idf value of more than 2 for feature selection
features<-ngram.df%>%
ungroup()%>%
filter(tf_idf > 2)%>%
filter(!is.na(features))%>%
select(features)
features
# A tibble: 2 x 1
features
<chr>
1 care body
2 leave achieve elusive
Once features have been shorlisted, we can go ahead and create the document term matrix where each row would represent the text record and columns would represent the features identified. Eseentially we are converting unstructured text data to structured format
feature.df<-ngram.df%>%
select(line,features,tf_idf)%>%
inner_join(features,by="features")
head(feature.df)
# A tibble: 2 x 3
# Groups: line [2]
line features tf_idf
<int> <chr> <dbl>
1 5 care body 2.20
2 8 leave achieve elusive 2.20
You can see that in the process of mapping the text data with Features, we have lost row number 1,2,3,4,6 and 7 . In order to avoid dropping off records, lets add “dummy” text to all the line records to the above data frame
feature.df<-ngram.df%>%
select(line,features,tf_idf)%>%
rbind.data.frame(data.frame(line=1:nrow(text_df),features="dummy",tf_idf=1))%>%
inner_join(rbind.data.frame(features,"dummy"),by="features")
feature.df
# A tibble: 11 x 3
# Groups: line [9]
line features tf_idf
<int> <chr> <dbl>
1 5 care body 2.20
2 8 leave achieve elusive 2.20
3 1 dummy 1
4 2 dummy 1
5 3 dummy 1
6 4 dummy 1
7 5 dummy 1
8 6 dummy 1
9 7 dummy 1
10 8 dummy 1
11 9 dummy 1
Now we can use the spread function to convert the above data frame from long to wide (or from unpivotted to pivotted)
DTM<-feature.df%>%
spread(features,"tf_idf",fill=0)%>%
select(-dummy)
DTM
# A tibble: 9 x 3
# Groups: line [9]
line `care body` `leave achieve elusive`
<int> <dbl> <dbl>
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 2.20 0
6 6 0 0
7 7 0 0
8 8 0 2.20
9 9 0 0
Final Comments
We saw how text data can be easily analysed using TF-IDF metric by running through a simple example and understanding the calcualtion behind the metric. Next blog will focus on a use case around using a topic model to divide text data into meaningful topics
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html