In this post we would look at bag of words concept in python. Bag of words is basically used to convert unstructured data into structured data by creating Features (similar to columns in a Structured Data Frame). Bag of Words uses frequency as a metric to generate data frame. For instance, if my unstructured data set looks like:
- Virat Kohli is captain of Indian team
- Virat is a great batsman
- Virat will break all batting records
Let's say we take 'Virat' as a feature. Virat appears once in each of the three sentences. So my structured data frame will look like
Virat
1
1
1
In the above example I have converted an unstructured data set into a structured data set. This can be used for any further analysis.
In the blog we will look at the following in detail:
- Ngrams
- Unigrams
- Bigrams
- Trigrams
- Create data frame for ngrams
The following libraries will be used:
- Pandas
- nltk
Download the ipynb file and html version to understand the flow
No comments:
Post a Comment