Sunday, May 5, 2019

Introduction to NLP Part 3: Bag of Words using Term Frequency (TF)


In this post we would look at bag of words concept in python. Bag of words is basically used to convert unstructured data into structured data by creating Features (similar to columns in a Structured Data Frame). Bag of Words uses frequency as a metric to generate data frame. For instance, if my unstructured data set looks like:

  1. Virat Kohli is captain of Indian team
  2. Virat is a great batsman
  3. Virat will break all batting records

Let's say we take 'Virat' as a feature. Virat appears once in each of the three sentences. So my structured data frame will look like

Virat
1
1
1

In the above example I have converted an unstructured data set into a structured data set. This can be used for any further analysis.

In the blog we will look at the following in detail:
  1. Ngrams
    • Unigrams
    • Bigrams
    • Trigrams
  2. Create data frame for ngrams
The following libraries will be used:
  1. Pandas
  2. nltk

Download the ipynb file and html version to understand the flow

No comments:

Post a Comment

Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...