Sunday, June 28, 2020
Book Review- The Race of my Life: An Autobiography
My rating: 4 of 5 stars
No matter what hand life deals you, one should always strive to be positive and take life headstrong. This is the gist of Milkha Singh's life. Right from his torn childhood during partition, life in army to being asia's number one athlete, Milkha Singh gives us strength and the positivity to combat the mundane existence of ordinary life. Hope and despair are two things that are deeply entwined in his initial years where he tries to pick himself up from the ruins of partition. Drifting through most of his childhood and youth, it was only when he joined the army that he got some sense of direction. There he was identified and lauded for his sporting acumen and the laurels it brought to his regiment.
Fast forward to 1956 and he is part of the Indian Olympic contingent to Australia. Having never been exposed to the brilliance of world class athletes, he failed to make a mark. It is post the 1956 Olympics that he set himself a clear goal of winning the 400 metres race. Enroute, he won a lot of races including the ones at National games ,Commonwealth Games where he set an altogether new record. Going into the 1960 Olympics, most anticipated Milkha Singh to win the 400 meters event and best the all time record. The fairytale eventually ended with him finishing 4th.Post 1960, his life as a runner took backstage and settled into oblivion. Post his voluntary retirement from army, he joined Punjab govt as a Sports administrator with a vision to identify and nurture bright sportsmen from a very young age. In this duration, he married Nimmi and settled down in Chandigarh. Even post his retirement, he continued to push for an overhaul of the sporting infrastructure and result oriented model to coaching staff.
From the various stages of his life, one can learn to never give up and pick oneself up from all setbacks. It is the waxing and waning of life through which one has wade through and make a mark
View all my reviews
Saturday, June 20, 2020
Blog 28: Analysing text using TF-IDF
Term Frequency Inverse Document Frequency
Parag Verma
Introduction
In this Blog we will look at how to use TF-IDF metric to analyse text. We will take the text example from the previous blog to understand key differences between Term Frequency and Term Frequency Inverse Document Frequency approach.Boradly we will look at the following topics
- What is TF-IDF
- Calculating TF-IDF on text data
- Creating the Document Term Matrix(DTM)
Installing the library: tidytext along with dplyr,tidyr and stringr package
package.name<-c("tidytext","textstem","dplyr","tidyr","stringr")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
What is Term Frequency Inverse Document Frequency (TF-IDF)
It is a metric calculated by multiplying the frequency (TF) of a word by the inverse document frequency(IDF). IDF decreases the weight of the commonly used terms and increases the importance of words that are not used much in the text data. TF * IDF (TF-IDF) is nothing but frequency of the term adjusted to how rarely it is used
We already know the formulae of TF. Lets look at how to calculate IDF
IDF is equal to \(\mathrm{ln}(n/n_{word})\) where
n is the total number of records in the data set
\(n_{word}\) total number of documents in which the word appears
Create a sample text data set
Lets create sample text data to understand key concepts better.
string_txt <- c("Roger Federer is undoubtedly the Greatest tennis player of all times",
"His legacy is not in the number of grand slam championships",
" he has won.",
"He will defintely be remembered for the longevity of his career",
" and how he was able to take care of his body over the years",
"His return in 2017 and winning the Autralian open against his",
" arch rival Nadal is considered to be a modern day spectacle",
"The only thing left to achieve is the elusive",
" Olympic gold in Tennis singles")
# In order to analyze this we need to convert it into a data frame
text_df<-data.frame(line=1:length(string_txt),text=string_txt,stringsAsFactors = F)
text_df
line text
1 1 Roger Federer is undoubtedly the Greatest tennis player of all times
2 2 His legacy is not in the number of grand slam championships
3 3 he has won.
4 4 He will defintely be remembered for the longevity of his career
5 5 and how he was able to take care of his body over the years
6 6 His return in 2017 and winning the Autralian open against his
7 7 arch rival Nadal is considered to be a modern day spectacle
8 8 The only thing left to achieve is the elusive
9 9 Olympic gold in Tennis singles
Step 1: Tokenization,lemmatization and Removing Stop words
custom_words<-c("legacy")
Stopword_custom<-data.frame(word=custom_words,stringsAsFactors = F)%>%
cbind("lexicon"="SMART")%>%
rbind.data.frame(stop_words)
token.df<-text_df %>%
unnest_tokens(word, text)%>%
mutate(word2=lemmatize_strings(word, dictionary = lexicon::hash_lemmas))%>%
select(-word)%>%
rename(word=word2)%>%
anti_join(Stopword_custom)
head(token.df)
line word
1 1 roger
2 1 federer
3 1 undoubtedly
4 1 tennis
5 1 player
6 1 time
We can see that the the text has been broken into individual chunks.These chunks are known as tokens. There is a column for row number created by the name line which can be used for grouping some frequency related metrics at row level. We have also used lemmatization and stop word removal to standardise text data
Step 2: Calculating TF-IDF for unigrams
tf_idf.unigram<-token.df%>%
group_by(line,word)%>%
summarise(Total_Count=n())%>%
bind_tf_idf(word, line, Total_Count)
head(tf_idf.unigram,8)
# A tibble: 8 x 6
# Groups: line [2]
line word Total_Count tf idf tf_idf
<int> <chr> <int> <dbl> <dbl> <dbl>
1 1 federer 1 0.167 2.20 0.366
2 1 player 1 0.167 2.20 0.366
3 1 roger 1 0.167 2.20 0.366
4 1 tennis 1 0.167 1.50 0.251
5 1 time 1 0.167 2.20 0.366
6 1 undoubtedly 1 0.167 2.20 0.366
7 2 championship 1 0.25 2.20 0.549
8 2 grand 1 0.25 2.20 0.549
Lets take line 1 and go through some of the values of tf and idf. For the word ‘federer’ * TF:There are 6 words in line 1 and all appear only once. Hence term frequency for each will be 1/6 which is around 0.16. * IDF:There are a total of 9 documents(rows of text data) and federer appears only in the first one. so ln(9/1) is around 2.19 * TF-IDF: 0.16x2.19 gives 0.36
similar inference can be made for other words as well
Step 2.b: Calculating TF-IDF for unigrams,bigrams and trigrams
Here we will calculate TF-IDF scores individually for unigrams, bigrams and trigrams and then combine all the three results together
unigram.df<-token.df%>%
unnest_tokens(features, word, token = "ngrams", n = 1)%>%
group_by(line,features)%>%
summarise(Total_Count=n())%>%
bind_tf_idf(features, line, Total_Count)
bigram.df<-token.df%>%
unnest_tokens(features, word, token = "ngrams", n = 2)%>%
group_by(line,features)%>%
summarise(Total_Count=n())%>%
bind_tf_idf(features, line, Total_Count)
trigram.df<-token.df%>%
unnest_tokens(features, word, token = "ngrams", n = 3)%>%
group_by(line,features)%>%
summarise(Total_Count=n())%>%
bind_tf_idf(features, line, Total_Count)
ngram.df<-rbind.data.frame(unigram.df,bigram.df,trigram.df)%>%
arrange(desc(tf_idf))
head(ngram.df,20)
# A tibble: 20 x 6
# Groups: line [7]
line features Total_Count tf idf tf_idf
<int> <chr> <int> <dbl> <dbl> <dbl>
1 5 care body 1 1 2.20 2.20
2 8 leave achieve elusive 1 1 2.20 2.20
3 3 win 1 1 1.50 1.50
4 5 body 1 0.5 2.20 1.10
5 5 care 1 0.5 2.20 1.10
6 8 achieve elusive 1 0.5 2.20 1.10
7 8 leave achieve 1 0.5 2.20 1.10
8 2 grand slam championship 1 0.5 2.20 1.10
9 2 numb grand slam 1 0.5 2.20 1.10
10 4 defintely remember longevity 1 0.5 2.20 1.10
11 4 remember longevity career 1 0.5 2.20 1.10
12 6 2017 win autralian 1 0.5 2.20 1.10
13 6 return 2017 win 1 0.5 2.20 1.10
14 9 gold tennis single 1 0.5 2.20 1.10
15 9 olympic gold tennis 1 0.5 2.20 1.10
16 8 achieve 1 0.333 2.20 0.732
17 8 elusive 1 0.333 2.20 0.732
18 8 leave 1 0.333 2.20 0.732
19 2 grand slam 1 0.333 2.20 0.732
20 2 numb grand 1 0.333 2.20 0.732
As you can see, legacy has been removed from the word column
Step 3:Creating the DTM
Lets use a tf-idf value of more than 2 for feature selection
features<-ngram.df%>%
ungroup()%>%
filter(tf_idf > 2)%>%
filter(!is.na(features))%>%
select(features)
features
# A tibble: 2 x 1
features
<chr>
1 care body
2 leave achieve elusive
Once features have been shorlisted, we can go ahead and create the document term matrix where each row would represent the text record and columns would represent the features identified. Eseentially we are converting unstructured text data to structured format
feature.df<-ngram.df%>%
select(line,features,tf_idf)%>%
inner_join(features,by="features")
head(feature.df)
# A tibble: 2 x 3
# Groups: line [2]
line features tf_idf
<int> <chr> <dbl>
1 5 care body 2.20
2 8 leave achieve elusive 2.20
You can see that in the process of mapping the text data with Features, we have lost row number 1,2,3,4,6 and 7 . In order to avoid dropping off records, lets add “dummy” text to all the line records to the above data frame
feature.df<-ngram.df%>%
select(line,features,tf_idf)%>%
rbind.data.frame(data.frame(line=1:nrow(text_df),features="dummy",tf_idf=1))%>%
inner_join(rbind.data.frame(features,"dummy"),by="features")
feature.df
# A tibble: 11 x 3
# Groups: line [9]
line features tf_idf
<int> <chr> <dbl>
1 5 care body 2.20
2 8 leave achieve elusive 2.20
3 1 dummy 1
4 2 dummy 1
5 3 dummy 1
6 4 dummy 1
7 5 dummy 1
8 6 dummy 1
9 7 dummy 1
10 8 dummy 1
11 9 dummy 1
Now we can use the spread function to convert the above data frame from long to wide (or from unpivotted to pivotted)
DTM<-feature.df%>%
spread(features,"tf_idf",fill=0)%>%
select(-dummy)
DTM
# A tibble: 9 x 3
# Groups: line [9]
line `care body` `leave achieve elusive`
<int> <dbl> <dbl>
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 2.20 0
6 6 0 0
7 7 0 0
8 8 0 2.20
9 9 0 0
Final Comments
We saw how text data can be easily analysed using TF-IDF metric by running through a simple example and understanding the calcualtion behind the metric. Next blog will focus on a use case around using a topic model to divide text data into meaningful topics
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html
Embed Shiny
Please wait...
-
Complete List of various topics in R Complete List of various topics in R Parag Verma Basics o...
-
Customer Journey Analysis Customer Journey Analysis Parag Verma 10th J...
-
Sensors are used in a lot of industrial applications to measure properties of a process. This can be temperature, pressure, humidity, den...