Machine Learning Made Easy

Tuesday, May 14, 2019

Introduction to NLP Part 4: Term Frequency Inverse Document Frequency (TF-IDF)

In this post we would look at metric other than Term Frequency to generate Features. It is mathematically represented as :

TF-IDF = TF * IDF, where

TF = Number of times a given token appears in a document/Total words in the document

IDF =log(Total Number of Documents/Total documents in which the token is present)

For instance, if my unstructured data set looks like:

Virat Kohli is captain of Indian team
Virat is a great batsman
Virat will break all batting records

Let's say we want to calculate the TF-IDF value of 'Virat' for Sentence 1.
TF = Number of times a given token appears in a document/Total words in the document
= 1/7
= 0.1428

IDF = log(Total Number of Documents/Total documents in which the token is present)
= log(3/3)
= 0

So TF-IDF= 0.1428 * 0
= 0

As can be seen that if a token is present across numerous documents, it TF-IDF value will be close to zero. The TF-IDF score will be high for words used less frequently. Lets calculate the TF-IDF value for 'great' token from Sentence 2

TF = Number of times a given token appears in a document/Total words in the document
= 1/5
= 0.20

IDF = log(Total Number of Documents/Total documents in which the token is present)
= log(3/1)...... base 10
= 0.477

So TF-IDF= 0.20 * 0.477
= 0.095

Since 'great' is not present in Sentence 1 and 3, their TF-IDF score will be zero. So if we make 'great' as a Feature, then the resultant data frame will look like

Great

0.095

In the above example I have converted an unstructured data set into a structured data set. This can be used for any further analysis.

In the blog we will look at the following in detail:

TF-IDF Basics
Create a function that can calculate the TF-IDF values

The following libraries will be used:

Pandas
nltk
string

Download Link:https://drive.google.com/drive/folders/1q-mvC336C2pp6mhcNedncdEiRKMG1K_f?usp=sharing

Download the ipynb file and html version to understand the flow

Links to my previous blogs on NLP:

Blog 1: https://mlmadeeasy.blogspot.com/2019/04/introduction-to-nlp-part-1-tokenization.html

Blog 2: https://mlmadeeasy.blogspot.com/2019/04/introduction-to-nlp-part-2-regular.html

Blog 3: https://mlmadeeasy.blogspot.com/2019/05/introduction-to-nlp-part-3-bag-of-words.htm

Sunday, May 5, 2019

Introduction to NLP Part 3: Bag of Words using Term Frequency (TF)

In this post we would look at bag of words concept in python. Bag of words is basically used to convert unstructured data into structured data by creating Features (similar to columns in a Structured Data Frame). Bag of Words uses frequency as a metric to generate data frame. For instance, if my unstructured data set looks like:

Virat Kohli is captain of Indian team
Virat is a great batsman
Virat will break all batting records

Let's say we take 'Virat' as a feature. Virat appears once in each of the three sentences. So my structured data frame will look like

Virat

In the above example I have converted an unstructured data set into a structured data set. This can be used for any further analysis.

In the blog we will look at the following in detail:

Ngrams

Unigrams
Bigrams
Trigrams

Create data frame for ngrams

The following libraries will be used:

Pandas
nltk

Download Link: https://drive.google.com/drive/folders/1IO0_ZLRDQha8xDImM7_s9GJ6H5riCyuP?usp=sharing

Download the ipynb file and html version to understand the flow

Thursday, April 18, 2019

Introduction to NLP Part 2: Regular Expression in Python

Regular Expression is like a series of characters that is used to search a definite pattern in text. These are often used to extract information from both structured as well as unstructured text corpus. Almost all programming language have a well defined library of functions used for this purpose. In this blog we would look at some of the common functions that are used in python along with some scenario based use cases. The broad objective of this blog is to:

Get familiar with functions used for search
Exploring the 're' library
Use the expressions in a list and data frame to

Search text
Replace text

Link to extract python(ipynb) file:

https://drive.google.com/file/d/1G87XQbALi-EU6koFdz4u2MuFdQ_xm7hY/view?usp=sharing

Link to extract the html version:
https://drive.google.com/file/d/1EOudA7eL1Rk0TyeUwiqvGSRUgD0DpRWX/view?usp=sharing

Saturday, April 6, 2019

Introduction to NLP Part 1: Tokenization, Lemmatization and Stop Word Removal

In this post we would look at how to handle text data in python. Any text analysis activity basically has three main components:

Tokenization
Lemmatization/Stemming
Stop Word Removal

We would look at a small text example and understand how to perform the above three steps using the nltk library. I have performed all the operation by downloading all the methods in nltk using the following line of code

nltk.download()

I have not mentioned the above line of code in the attached python notebook and html version but it is advisable for users to run the above line after doing import nltk. The nltk.download() will take some time (few hours) to download all the relevant packages to your console. After this you can run the entire python script.

Download Link: https://drive.google.com/drive/folders/12LrZTI5qT-vzz6ce5dpXZ2ucdUsfa9S_?usp=sharing

Download the ipynb file and html version to understand the flow

Friday, March 15, 2019

Handling JSON in Python

In this post we would look at scenarios where we have to work with Java script object notation(JSON). The nature of the task as well as resources have been summarized in the following sections:

Problem Statement: Many a times we are given a situation where we need to extract information from JSON object before it can be used further.This is a typical case when dealing with creation of tools to process information or while extracting information from some other source or while exchanging information. The aim is to understand JSON and how it can be converted into a structured data using python.

Data Download Link:

Download example1 and example2 from this link

https://drive.google.com/drive/folders/1xjw6a17zCaa8sYj8ZerXv7SmGeuuEeVq?usp=sharing

Learning Objectives:

Handling JSON

Understanding JSON object
Using json library
Converting it into a table (data frame)

Please download the jupyter file from the following link

Jupyter File link: https://drive.google.com/drive/folders/1xjw6a17zCaa8sYj8ZerXv7SmGeuuEeVq?usp=sharing
Download Handling Json.ipynb file from this link

Link to Html version:
https://drive.google.com/drive/folders/1xjw6a17zCaa8sYj8ZerXv7SmGeuuEeVq?usp=sharing

Saturday, February 9, 2019

Pandas Complete Guide

This blog contains collection of all the codes snippets related to pandas. It covers the following things:

Introduction to Series: https://mlmadeeasy.blogspot.com/2019/02/introduction-to-series.html
Hierarchical Indexing: https://mlmadeeasy.blogspot.com/2019/02/hierarchical-indexing.html
Combining and Merging Data sets: https://mlmadeeasy.blogspot.com/2019/02/combining-and-merging-datasets.html
Pivoting using Pandas: https://mlmadeeasy.blogspot.com/2019/02/pivoting-using-pandas.html
Frequency Profiling using Pandas: https://mlmadeeasy.blogspot.com/2019/02/frequency-profiling-using-pandas.html
Data Manipulation in Python Part 1: https://mlmadeeasy.blogspot.com/2019/01/data-manipulation-with-python.html
Data Manipulation in Python Part 2 (Retail Case Study):https://mlmadeeasy.blogspot.com/2019/01/data-manipulation-with-python-part-2.html

Frequency Profiling using Pandas

This blog contains introduction to frequency profiling. This is common place with data engineers who do ETL. It is also used at the start of every exploratory data analysis.

The blog consists of the following attachments:

Jupyter Notebook https://drive.google.com/file/d/1mL3GoVQjGTTUeWMmwDKtytGNRF0evE6D/view?usp=sharing
Html doc: https://drive.google.com/file/d/1ExuLhp1CgCjE0DCTR_G09LnD63I_Txp4/view?usp=sharing