In this post we would look at metric other than Term Frequency to generate Features. It is mathematically represented as :
TF-IDF = TF * IDF, where
TF = Number of times a given token appears in a document/Total words in the document
For instance, if my unstructured data set looks like:
- Virat Kohli is captain of Indian team
- Virat is a great batsman
- Virat will break all batting records
Let's say we want to calculate the TF-IDF value of 'Virat' for Sentence 1.
TF = Number of times a given token appears in a document/Total words in the document
= 1/7
= 0.1428
IDF = log(Total Number of Documents/Total documents in which the token is present)
= log(3/3)
= 0
So TF-IDF= 0.1428 * 0
= 0
As can be seen that if a token is present across numerous documents, it TF-IDF value will be close to zero. The TF-IDF score will be high for words used less frequently. Lets calculate the TF-IDF value for 'great' token from Sentence 2
TF = Number of times a given token appears in a document/Total words in the document
= 1/5
= 0.20
IDF = log(Total Number of Documents/Total documents in which the token is present)
= log(3/1)...... base 10
= 0.477
So TF-IDF= 0.20 * 0.477
= 0.095
Since 'great' is not present in Sentence 1 and 3, their TF-IDF score will be zero. So if we make 'great' as a Feature, then the resultant data frame will look like
TF = Number of times a given token appears in a document/Total words in the document
= 1/7
= 0.1428
IDF = log(Total Number of Documents/Total documents in which the token is present)
= log(3/3)
= 0
So TF-IDF= 0.1428 * 0
= 0
As can be seen that if a token is present across numerous documents, it TF-IDF value will be close to zero. The TF-IDF score will be high for words used less frequently. Lets calculate the TF-IDF value for 'great' token from Sentence 2
TF = Number of times a given token appears in a document/Total words in the document
= 1/5
= 0.20
IDF = log(Total Number of Documents/Total documents in which the token is present)
= log(3/1)...... base 10
= 0.477
So TF-IDF= 0.20 * 0.477
= 0.095
Since 'great' is not present in Sentence 1 and 3, their TF-IDF score will be zero. So if we make 'great' as a Feature, then the resultant data frame will look like
Great
0
0.095
0
In the above example I have converted an unstructured data set into a structured data set. This can be used for any further analysis.
In the blog we will look at the following in detail:
- TF-IDF Basics
- Create a function that can calculate the TF-IDF values
The following libraries will be used:
- Pandas
- nltk
- string
Download Link:https://drive.google.com/drive/folders/1q-mvC336C2pp6mhcNedncdEiRKMG1K_f?usp=sharing
Download the ipynb file and html version to understand the flow
Links to my previous blogs on NLP:
Blog 1: https://mlmadeeasy.blogspot.com/2019/04/introduction-to-nlp-part-1-tokenization.html
Blog 2: https://mlmadeeasy.blogspot.com/2019/04/introduction-to-nlp-part-2-regular.html
Blog 3: https://mlmadeeasy.blogspot.com/2019/05/introduction-to-nlp-part-3-bag-of-words.htm
Links to my previous blogs on NLP:
Blog 1: https://mlmadeeasy.blogspot.com/2019/04/introduction-to-nlp-part-1-tokenization.html
Blog 2: https://mlmadeeasy.blogspot.com/2019/04/introduction-to-nlp-part-2-regular.html
Blog 3: https://mlmadeeasy.blogspot.com/2019/05/introduction-to-nlp-part-3-bag-of-words.htm