Sunday, May 26, 2019

Introduction to NLP Blog 5: Creating Document Term Matrix (DTM)

In this post we would look at creating Document Term Matrix (DTM) using the Features generated from Term Frequency(TF) or/and  Term Frequency-Inverse Document Frequency(TF-IDF) .This will result in the creation of Data Frame that can be used for any subsequent analyses. This data frame will be called as Document Term Matrix. It has the following Properties:

  1. First column will represent Document ID (Doc_ID)
  2. Other columns will represent the Features selected as a result of profiling activity
  3. Values in the cells will represent either
    1. Term Frequency (TF) 
    2. Term Frequency-Inverse Document Frequency(TF-IDF) 
In the blog, I have covered how to create functions which takes the following things as input:
  1. Name of the data frame
  2. Column containing the text data
  3. select ngram parameter in the form of 1,2,3...
    1. A list can be passed to ngram. If I pass [1,2], then the function will create unigrams and bigrams summary
It generates the Document ID/Row wise summary of Tokens in the form of frequency or TF-IDF. The following functions have been created:
  1. Tokenize the text
  2. Create frequency profile at Document_ID level
  3. Create TF-IDF profile at Document_ID level
  4. Clean the text for any punctuation
I have also tested the functions to see how to fair when the records becomes high (~100k). The function performed fairly.



Download the ipynb file,html version and the csv file to understand the flow






No comments:

Post a Comment

Embed Shiny

Please wait...