Sunday, May 26, 2019

Introduction to NLP Blog 5: Creating Document Term Matrix (DTM)

In this post we would look at creating Document Term Matrix (DTM) using the Features generated from Term Frequency(TF) or/and  Term Frequency-Inverse Document Frequency(TF-IDF) .This will result in the creation of Data Frame that can be used for any subsequent analyses. This data frame will be called as Document Term Matrix. It has the following Properties:

  1. First column will represent Document ID (Doc_ID)
  2. Other columns will represent the Features selected as a result of profiling activity
  3. Values in the cells will represent either
    1. Term Frequency (TF) 
    2. Term Frequency-Inverse Document Frequency(TF-IDF) 
In the blog, I have covered how to create functions which takes the following things as input:
  1. Name of the data frame
  2. Column containing the text data
  3. select ngram parameter in the form of 1,2,3...
    1. A list can be passed to ngram. If I pass [1,2], then the function will create unigrams and bigrams summary
It generates the Document ID/Row wise summary of Tokens in the form of frequency or TF-IDF. The following functions have been created:
  1. Tokenize the text
  2. Create frequency profile at Document_ID level
  3. Create TF-IDF profile at Document_ID level
  4. Clean the text for any punctuation
I have also tested the functions to see how to fair when the records becomes high (~100k). The function performed fairly.



Download the ipynb file,html version and the csv file to understand the flow






Tuesday, May 14, 2019

Introduction to NLP Part 4: Term Frequency Inverse Document Frequency (TF-IDF)


In this post we would look at metric other than Term Frequency to generate Features. It is mathematically represented as :

TF-IDF = TF * IDF, where 

TF = Number of times a given token appears in a document/Total words in the document

IDF =log(Total Number of Documents/Total documents in which the token is present)

For instance, if my unstructured data set looks like:

  1. Virat Kohli is captain of Indian team
  2. Virat is a great batsman
  3. Virat will break all batting records

Let's say we want to calculate the TF-IDF value of 'Virat' for Sentence 1.
TF   =  Number of times a given token appears in a document/Total words in the document
       = 1/7
       = 0.1428

IDF = log(Total Number of Documents/Total documents in which the token is present)
       = log(3/3)
       = 0

So TF-IDF= 0.1428 * 0
                 = 0

As can be seen that if a token is present across numerous documents, it TF-IDF value will be close to zero. The TF-IDF score will be high for words used less frequently. Lets calculate the TF-IDF value for 'great' token from Sentence 2

TF   =  Number of times a given token appears in a document/Total words in the document
       = 1/5
       = 0.20

IDF = log(Total Number of Documents/Total documents in which the token is present)
       = log(3/1)...... base 10
       = 0.477

So TF-IDF= 0.20 * 0.477
                 = 0.095

Since 'great' is not present in Sentence 1 and 3, their TF-IDF score will be zero. So if we make 'great' as a Feature, then the resultant data frame will look like

Great
0
0.095
0

In the above example I have converted an unstructured data set into a structured data set. This can be used for any further analysis.

In the blog we will look at the following in detail:
  1. TF-IDF Basics
  2. Create a function that can calculate the TF-IDF values 
The following libraries will be used:
  1. Pandas
  2. nltk
  3. string
Download Link:https://drive.google.com/drive/folders/1q-mvC336C2pp6mhcNedncdEiRKMG1K_f?usp=sharing


Download the ipynb file and html version to understand the flow

Links to my previous blogs on NLP:

Blog 1: https://mlmadeeasy.blogspot.com/2019/04/introduction-to-nlp-part-1-tokenization.html

Blog 2: https://mlmadeeasy.blogspot.com/2019/04/introduction-to-nlp-part-2-regular.html


Blog 3: https://mlmadeeasy.blogspot.com/2019/05/introduction-to-nlp-part-3-bag-of-words.htm



Sunday, May 5, 2019

Introduction to NLP Part 3: Bag of Words using Term Frequency (TF)


In this post we would look at bag of words concept in python. Bag of words is basically used to convert unstructured data into structured data by creating Features (similar to columns in a Structured Data Frame). Bag of Words uses frequency as a metric to generate data frame. For instance, if my unstructured data set looks like:

  1. Virat Kohli is captain of Indian team
  2. Virat is a great batsman
  3. Virat will break all batting records

Let's say we take 'Virat' as a feature. Virat appears once in each of the three sentences. So my structured data frame will look like

Virat
1
1
1

In the above example I have converted an unstructured data set into a structured data set. This can be used for any further analysis.

In the blog we will look at the following in detail:
  1. Ngrams
    • Unigrams
    • Bigrams
    • Trigrams
  2. Create data frame for ngrams
The following libraries will be used:
  1. Pandas
  2. nltk

Download the ipynb file and html version to understand the flow

Embed Shiny

Please wait...