Tuesday, May 14, 2019

Introduction to NLP Part 4: Term Frequency Inverse Document Frequency (TF-IDF)


In this post we would look at metric other than Term Frequency to generate Features. It is mathematically represented as :

TF-IDF = TF * IDF, where 

TF = Number of times a given token appears in a document/Total words in the document

IDF =log(Total Number of Documents/Total documents in which the token is present)

For instance, if my unstructured data set looks like:

  1. Virat Kohli is captain of Indian team
  2. Virat is a great batsman
  3. Virat will break all batting records

Let's say we want to calculate the TF-IDF value of 'Virat' for Sentence 1.
TF   =  Number of times a given token appears in a document/Total words in the document
       = 1/7
       = 0.1428

IDF = log(Total Number of Documents/Total documents in which the token is present)
       = log(3/3)
       = 0

So TF-IDF= 0.1428 * 0
                 = 0

As can be seen that if a token is present across numerous documents, it TF-IDF value will be close to zero. The TF-IDF score will be high for words used less frequently. Lets calculate the TF-IDF value for 'great' token from Sentence 2

TF   =  Number of times a given token appears in a document/Total words in the document
       = 1/5
       = 0.20

IDF = log(Total Number of Documents/Total documents in which the token is present)
       = log(3/1)...... base 10
       = 0.477

So TF-IDF= 0.20 * 0.477
                 = 0.095

Since 'great' is not present in Sentence 1 and 3, their TF-IDF score will be zero. So if we make 'great' as a Feature, then the resultant data frame will look like

Great
0
0.095
0

In the above example I have converted an unstructured data set into a structured data set. This can be used for any further analysis.

In the blog we will look at the following in detail:
  1. TF-IDF Basics
  2. Create a function that can calculate the TF-IDF values 
The following libraries will be used:
  1. Pandas
  2. nltk
  3. string
Download Link:https://drive.google.com/drive/folders/1q-mvC336C2pp6mhcNedncdEiRKMG1K_f?usp=sharing


Download the ipynb file and html version to understand the flow

Links to my previous blogs on NLP:

Blog 1: https://mlmadeeasy.blogspot.com/2019/04/introduction-to-nlp-part-1-tokenization.html

Blog 2: https://mlmadeeasy.blogspot.com/2019/04/introduction-to-nlp-part-2-regular.html


Blog 3: https://mlmadeeasy.blogspot.com/2019/05/introduction-to-nlp-part-3-bag-of-words.htm



No comments:

Post a Comment

Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...