Machine Learning Made Easy

Tuesday, January 15, 2019

Data manipulation with Python Part 2

In this post we would look at another Data Manipulation scenario. The nature of the task as well as resources have been summarized in the following sections:

Problem Statement: There is a sports accessories company ABC that sells sports gear across the globe. The data has fields such as Revenue, Quantity, Gross Margin, Order Method, Time, Country etc spread from 2012 to 2014 across 4 Quarters (Q1 through Q4). Based on the Global Outlook and growth forecasts made by economist at ABC, the company has decided to sell total units equal to 211,555,475 in 2015. However, this is an overall number and Product Managers for individual countries don't know how this number would drill down to individual Product. Your job as a Business Analyst is to help the Product Managers get the numbers so that they can plan effectively

Data Download Link:

https://docs.google.com/spreadsheets/d/1aRD2-xZrxcARSHZAi-93fWQBuhr6D8TOgkBaT_umkvg/edit#gid=165274715

Learning Objectives:

Using Pandas for data manipulation

Getting records and column names
Handle NA
Using Groupby to sum up numbers at different levels

Using 'apply' function on certain columns
Using lambda along with apply to get Percentage Share
% formatting

Please download the jupyter file from the following link

Jupyter File link: https://drive.google.com/file/d/1yH2LkcuWYNPmGrLA2Up2t6OnfoKZewvo/view?usp=sharing

Link to Html version:
https://drive.google.com/file/d/1l1ewyfLpXVrtyrlfh_JBW_EW0xbkgVsX/view?usp=sharing

Saturday, January 5, 2019

Data Manipulation with Python

In any Data Science or Business Analytics project, exploring the data is an important aspect It helps us to deep dive into the data to get numerical summaries and various hidden trends that are otherwise obscure to start with.Most of the times, a thorough analysis of data is sufficient enough to facilitate decision making. Hence it is imperative to know the basic tools to analyse data and interpret results. With high adoption of Python as a preferred Data Science tool, we will look at a small case study about a Product based company and how python can be used to answer questions related to typical data exploration exercise .

The data considered here consists of Quantity sold, Revenue and Gross Margin information for various sports product sold in different countries. Other Features are given in the form of Product Information,Order Method and type of retailer. This is spread across 4 Quarters from 2012 to 2014

To fully understand the data, the following things will be explored

Task 0: How many missing/NA/Null values are present in the data

Task 1: Which order method is most widely used

Task 2:Which Product has the max units sold on an overall basis

Task 3:Which country has the max value of Gross Margin per unit Quantity

Task 4:Which product line under 'Web' Method type has highest Unit sold for 2012

Task 5:Is this also true for 2013 and 2014

Task 6:Which product has seen the max decrease in Revenue from 2012 to 2014

Task 7:In Product line, Personal Accessories has seen a decline in Gross Margin

# Identify which Product has contributed most to this

# Identify which country has contributed most to this

Task 8: Identify top products for each country with respect to Revenue from 2012 to 2014

Download the file: https://docs.google.com/spreadsheets/d/1aRD2-xZrxcARSHZAi-93fWQBuhr6D8TOgkBaT_umkvg/edit#gid=165274715

# importing pandas

# setting up directory

import pandas as pd

dr="Your computer path where file is saved"

# Reading the csv file into the console. I am using Spyder to analyse data. The analyses can be carried out using Anaconda also

df=pd.read_csv(dr+"\\Gross_Margin_Data.csv")

# Total rows and column

df.shape # Has 88475 rows and 11 columns

df.columns

# Inspecting first few records

df.head()

Task 0: How many missing/NA/Null values are present in the data

df.isna().sum()

Figure 1

# There are 581 NA entries in the Gross margin column

# These should be replaced by the mean of the Gross Margin column

# So that it doesnt affect result from any other analysis

df['Gross margin']=df['Gross margin'].fillna(df['Gross margin'].mean())

df.isna().sum()# Now there are no NA values in Gross Margin column

Figure 2

# Task1: Which Order Method is most prevalent
df['Order method type'].value_counts()# Web Method

Figure 3

Task 2: Which Product has the maximum units sold

df['Product'].value_counts().head() # Polar Sun has the highest unit sold

Figure 4

Task 3: Which country has the max value of Gross Margin per unit Quantity

# Here we need to aggregate the data at a country level

df1=df.groupby('Retailer country')['Gross margin','Quantity'].sum()

df1.columns # Only Gross Margin and Quantity are present as columns

# We need to reset the index of df1 so that Retailer Country also becomes a column

df1=df1.reset_index()

# Now creating a column of Gross Margin/Quantity

df1['Unit_Measure']=df1['Gross margin']/df1['Quantity']

df1.sort_values('Unit_Measure',ascending=False).head(10)

Figure 5

# We can see that Denmark has the highest value of Gross Margin/Unit Quantity sold

Task 4: Which product under 'Web' Method type has highest Unit sold for 2012

Before we start on this we need to first know which is the most preferred method of ordering products

df['Order method type'].value_counts() # Web is the most preferred method

# Lets see which product under Web has the most Sales

df[df['Order method type']=='Web'].groupby(['Product'])['Quantity'].sum().sort_values(ascending=False).head(5)

Figure 6

# Product by the name Zone is most preferentially ordered using Web medium

# Lets verify if this is true for the year 2012 as well

df['Year'].head()

df[(df['Order method type']=='Web') & (df['Year']==2012)].groupby(['Year','Product'])['Quantity'].sum().sort_values(ascending=False).head(5)

Figure 7

# For 2012 as well 'Zone'product has the most Sales in 2012

Task 5:Does Product named Zone has the most Sales under Web Method of Ordering in 2013 and 2014 also

df2=df[df['Order method type']=='Web'].groupby(['Year','Product'])['Quantity'].sum()

df3=df2.reset_index()

df3=df3.sort_values(['Year','Quantity'],ascending=[True,False])

df3.groupby(['Year']).head(2) # Gives the top 2 products sold for each year under Web Method

Figure 8

# As we can see that for 2012 and 2013, Zone has the most Sales, but for 2014

# Granite Carabiner has the most Sales

# So Zone doesnt have most Sales for Web Orders from 2012 to 2014

Task 6: Which product has seen the max decrease in Revenue from 2012 to 2014

df4=df.groupby(['Product','Year'])['Revenue'].sum().reset_index()

df4.shape

# We need to remove the value corresponding to 2013 as only 2012 and 2014 are required

t_val=[2012,2014]

df5=df4[df4['Year'].isin(t_val)]

df5.shape

# Now lets pivot df4 based on column 'Year'

p = df5.pivot(index='Product', columns='Year', values='Revenue')
p.tail(16)# There are some NaN in p

Figure 9

p.isna().sum()

Figure 10

# We see that there are certain NA/Null values in p

# These are the values for which are the no entries in df4

# These entries have to be replaced with 0

p[2012]=p[2012].fillna(0)

p[2014]=p[2014].fillna(0)

# Checkingt to see if there is still any NA/Null values in p

p.isna().sum()

# Creating another column called Perc_Change

p['Perc_Change']=(p[2014]-p[2012])/p[2012]

p.head()

# Sorting p to get the highest percentage decrease

p.sort_values('Perc_Change').head(5)

Figure 11

# So Sun Blocker has seen the maximum decrease in Revenue from 2012 to 2014

Task 7:In Product line, Personal Accessories has seen a decline in Gross Margin

# Task 7a:Identify which Product has contributed most to this

# Task 7b::Identify which country has contributed most to this

Task 7a :Identify which Product has contributed most to this

# Lets first check what is the overall percentage decrease from 2013 to 2014

df7=df[(df['Product line']=='Personal Accessories') & df['Year'].isin(['2013','2014'])]

df7.groupby(['Year'])['Gross margin'].sum()

Figure 12

# We can see that there is approx 35% dip in the gross margin

# The decline in Gross Margin can be because of the following things

# The Top Product with max Gross Margin in 2013 had a dip in 2014

# Lets check this point

df8=df7.groupby(['Product','Year'])['Gross margin'].sum().reset_index()

d8.head()

Figure 13

df9=df8.pivot(index='Product',columns='Year',values='Gross margin').sort_values(2013,ascending=False)

d9.head()

Figure 14

df10=df9.fillna(0)

# Let create a new column indicating percentage decrease

df10['Perc_change']=(df10[2013]-df10[2014])/df10[2013]

df10.head()

Figure 15

# This gives the products that had the max gross margin in 2013

# but saw a significant dip (top 3 product had a dip of more than 50%)

# Lets also check the outlook in comparison to 2012

df11=df[(df['Product line']=='Personal Accessories')]

df11.groupby(['Year'])['Gross margin'].sum()

Figure 16

# Gross Margin of 2012 and 2013 are comparable so there has been really

# significant drop in Gross Margin in 2014

Task 7b :Identify which country has contributed most to this

df12=df7.groupby(['Retailer country','Year'])['Gross margin'].sum().reset_index()

df13=df12.pivot(index='Retailer country',columns='Year',values='Gross margin').sort_values(2013,ascending=False)

df14=df13.fillna(0)

# Let create a new column indicating percentage decrease

df14['Perc_change']=(df14[2013]-df14[2014])/df14[2013]

df14.head()

Figure 17

# Top 5 countries have seen a dip in Gross Margin

# Now we are in a position to comment on the product as well as countries

# that have contributed to the decrease in Gross Margin

Task 8: Identify top products for each country with respect to Revenue from 2012 to 2014

df15=df.groupby(['Retailer country','Product','Year'])['Revenue'].sum().reset_index()

df16=df15.sort_values(['Retailer country','Year','Revenue'],ascending=False)
df17=df16.groupby(['Retailer country','Year']).head(2)
df17.head(20)

Figure 18

# df17 can now be exported to a csv file to prepare reports

Thursday, December 6, 2018

Sensor Data Analysis in Python

Sensors are used in a lot of industrial applications to measure properties of a process. This can be temperature, pressure, humidity, density etc. In manufacturing plants output of sensors or transducers is sampled regularly to check for breaks in functionality. Quality standards mandates that the parameters such as temperature, pressure be maintained around an optimal point failing which can affect the product benchmarks. Traditionally, control bands have been used to capture any anomalous change in value of the metric. This method fails in most application where a time series data is involved as control bands give incorrect results due to serial autocorrelation. Since sensor data is a time series data, there is a need to look at other alternatives. The aim of this is to achieve the following things:

Look at a typical Sensor Network
Understand Histogram Based Outlier Scoring algorithm (HBOS)

Create function in python for Categorical Data
Create function in python for Numeric Data

Analyze the output of HBOS
Other use cases where HBOS can be used :Tell about thermal runaway

Part 1: Wireless Sensor Network

This is a group of sensors organized in a way to monitor and record data. Data represents the physical properties of a process which are generally monitored to ensure seamlessness in operations. The sensors can measure sound, temperature, pressure, fluid rate etc. These are generally integrated with a terminal where the recorded data is stored and analyzed. Typical sensor network can be represented by the following diagram

The architecture typically has the following parts:

Motes: A sensor node is also known as Mote. It can gather and process information and communicate with the other nodes
Network Managers: People dedicated with the task of managing the network topology and ensuring that it is up and running. They are also responsible for scaling the network up or down based on the requirements of the process. They also manage the exchange of data with the host application
Customer Software: It interacts with the network manager to get sensor data. The software analyses the feed from a sensor network to study the process and other parameters

Part 2: Understanding Histogram Based Outlier Scoring Algorithm (HBOS)

We will explore how to create functions in python to analyse the sensor data. The below demonstrates the usage of various libraries and functions in python.

# Analyzing a Categorical Variable

import numpy as np

import pandas as pd

import math

df=pd.DataFrame(np.array(['A','B','A','A','B']),columns=['Feature'])

# Writing the steps to score this column using HBOS method

df1=(df['Feature'].value_counts()/max(df['Feature'].value_counts())).reset_index()

df['Index_val']=range(0,df.shape[0])

df1.columns=['Feature','Proportion']

df2=pd.merge(df,df1,how='inner',on='Feature').sort_values('Index_val')

df3=pd.concat([df2['Feature'],np.log(df2['Proportion'])],axis=1)['Proportion']

# Function for scoring Categorical Column using HBOS

def Hbos_cat(col_nm):

df=pd.DataFrame(col_nm,columns=['Feature'])

df1=(df['Feature'].value_counts()/max(df['Feature'].value_counts())).reset_index()

df['Index_val']=range(0,df.shape[0])

df1.columns=['Feature','Proportion']

df2=pd.merge(df,df1,how='inner',on='Feature').sort_values('Index_val')

df3=pd.concat([df2['Feature'],np.log(df2['Proportion'])],axis=1)['Proportion']

return(df3)

# The result of the scores will be stored in Hbos_score

Hbos_score=Hbos_cat(df['Feature'])

# Analyzing a Numeric Variable

# Now we need to create a method for handling numeric column

# lets have the values stored in x

x=[12,11,10,9,8,10,11,14,17,20,50,60,70]

# First: Identify the total records in the numeric variable

N=len(x)

# Second: decide the number of bins to divide the data

# k<-round(sqrt(N))

k=math.sqrt(N)

# Next see the value of N/k:records in each group

records_Each_Group=round(N/k)

hb=pd.DataFrame(x)

hb.columns=['x']

hb['ID1']=range(1,len(x)+1)

hb=hb.sort_values('x')

hb['ID2']=range(1,len(x)+1)

hb['Group']=['G'+str(int(x)+1) for x in hb['ID2'] / (records_Each_Group+1) ]

g=hb.groupby('Group')['x']

# Max function for calculating the highest within each group

# https://pandas.pydata.org/pandas-docs/stable/groupby.html

# Link for using group by and apply

def max_f(group):

return(pd.DataFrame({'original':group,'Highest':group.max()})['Highest'])

hb['Highest']=g.apply(max_f)

# Min function for calculating the Lowest within each group

def min_f(group):

return(pd.DataFrame({'original':group,'Lowest':group.min()})['Lowest'])

hb['Lowest']=g.apply(min_f)

# Creating the difference column for Highest and Lowest

# If Highest - Lowest =0, then its value will be equal to 1

# Creating the function usnig lambda

cond_diff=lambda x,y: x-y if x!=y else 1

# Checking it on dummy data

list(map(cond_diff,[1,2],[1,3]))

hb['Diff_Flag']=list(map(cond_diff,hb['Highest'],hb['Lowest']))

# Now calculating the height of each bin by dividing the records in each group by Diff_Flag

hb['Height']=records_Each_Group/hb['Diff_Flag']

# Calculating the calibrated hieght based on zero height values

cond_diff2=lambda x: x if x!=0 else 1

hb['Height2']=list(map(cond_diff2,hb['Height']))

# Normalising the Height2 by dividing it by max(Height2)

hb['Normalised_Hieght']=hb['Height2']/max(hb['Height2'])

# Calculating the log of hb['Normalised_Hieght']

hb['hb_Score']=np.log(hb['Normalised_Hieght'])

# Now sorting the data frame based on ID1

hb_final=hb.sort_values('ID1')['hb_Score']

# Function for scoring Numeric Column using HBOS

def Hbos_num(col_nm):

# First: Identify the total records in the numeric variable

N=data_frame.shape[0]

# Second: decide the number of bins to divide the data
# k<-round(sqrt(N))
k=math.sqrt(N)

# Next see the value of N/k:records in each group
records_Each_Group=round(N/k)

hb=pd.DataFrame(data_frame)
hb.columns=['x']
hb['ID1']=range(1,N+1)
hb=hb.sort_values('x')

hb['ID2']=range(1,N+1)
hb['Group']=['G'+str(int(x)+1) for x in hb['ID2'] / (records_Each_Group+1) ]
g=hb.groupby('Group')['x']

# Max function for calculating the highest within each group
# https://pandas.pydata.org/pandas-docs/stable/groupby.html
# Link for using group by and apply

def max_f(group):
return(pd.DataFrame({'original':group,'Highest':group.max()})['Highest'])

hb['Highest']=g.apply(max_f)

# Min function for calculating the Lowest within each group

def min_f(group):
return(pd.DataFrame({'original':group,'Lowest':group.min()})['Lowest'])

hb['Lowest']=g.apply(min_f)

# Creating the difference column for Highest and Lowest
# If Highest - Lowest =0, then its value will be equal to 1
# Creating the function usnig lambda
cond_diff=lambda x,y: x-y if x!=y else 1

hb['Diff_Flag']=list(map(cond_diff,hb['Highest'],hb['Lowest']))

# Now calculating the height of each bin by dividing the records in each group by Diff_Flag
hb['Height']=records_Each_Group/hb['Diff_Flag']

# Calculating the calibrated hieght based on zero height values

cond_diff2=lambda x: x if x!=0 else 1
hb['Height2']=list(map(cond_diff2,hb['Height']))

# Normalising the Height2 by dividing it by max(Height2)
hb['Normalised_Hieght']=hb['Height2']/max(hb['Height2'])

# Calculating the log of hb['Normalised_Hieght']
hb['hb_Score']=np.log(hb['Normalised_Hieght'])

# Now sorting the data frame based on ID1
hb_final=hb.sort_values('ID1')['hb_Score']

return(hb_final)

# Creating a data frame to verify the functionality
df=pd.DataFrame(x,columns=['Feature'])

Hbos_num(df['Feature'])

# The function is working

# Part 3: Analyzing the Hbos Output

# Now importing the sensor data set from the following link
https://docs.google.com/spreadsheets/d/1ADPWzs1s-yNqtqbw-VLuPgPKpKXTAWIYHgO3H6grNO0/edit#gid=1682334191

dr="name of the path"
df=pd.read_csv(dr+"\\humidity.0.csv")
df.columns
df.shape[0]
# Using the Hbos_num function
Hbos_Score=Hbos_num(df['Humidity'])

# Merging the Hbos_score with df
df_new=pd.concat([df,Hbos_Score],axis=1)
df_new.head()
# Looking at the histogram of Hbos_score
import matplotlib.pyplot as plt
plt.hist(df_new['hb_Score'], bins='auto') # arguments are passed to np.histogram
plt.title("Histogram with 'auto' bins")
plt.show()

# The histogram is shown below

# The records shown in the red dotted oval are of interest to us as these represents reading having very high negative Hbos score.

# Analysis the records with highest score
df_new.describe()
df_new.sort_values('hb_Score').head(50)

# We can see that the entries with score less than -2.6 comes up as the top exceptions
df_exceptions=df_new[df_new['hb_Score']<-2.6]
df_exceptions.shape # 24 records

df_normal=df_new[df_new['hb_Score'] >=-2.6]
df_normal.shape # 496 records

# To further understand this, we need to do some Exploratory data analyses on the top of it like what is the time at which readings with highest Hbos score were made

# Part 4: Other Use Cases

Financial Transaction Outlier detection
Sentiment Analyses
Improving Data Quality

Link to my Youtube Channel

Thursday, November 15, 2018

R Vs Python

Why doesn't Cheteshwar Pujara come one down in a 50 over match ?. Or why doesn’t Aswin bowl flighted deliveries in limited over matches ? Despite being the ‘Hitman’ of ODI, Rohit Sharma seldom makes the cut in a 5 day contest. Are the required skill set different for a test match against a one dayer or for that matter a T20 ? All of this can be answered with a simple phrase-horses for courses. You pick players that fit the bill. All selection has to be made in accordance with the requirements. That’s precise isn’t it ! Same thing can be carried to usage of software for ML. This blog particularly explores the scenarios which are more conducive to the usage of Python against R and vice-versa.

Before we start, a little background of R and Python language is necessary in light of Machine Learning(ML). Python is a computer programming language. In the wake of recent advances in ML, python community contributed several libraries that enables one to play around with data. However, R as such is a typical Statistical programming language like Matlab. It was developed to cater to the Math community in the first place. There is lot of debate on which language is best and what to prefer for ML.The exasperation is aptly shown in the below image.

The table below highlights the key differences between Python and R wrt certain commonly used practices in ML. The entries in the cell indicates the library and/or function used to execute the requirement. The colored grid indicates the superiority of a given language over the other. In case of a tie, both the cells are colored.

Functionality	Python	R
Data Slicing and Summary	pandas	Dplyr, data.table
Visualization	matplotlib	ggplot
Data Set Repositories	NA	Economteric Data: AER library
Linear Models (Regression family)	Scikit learn	Car glm
Hyperparamter Tuning	makeLearner	GridSearchCV
Natural Language Processing(NLP)	NLTK, gensim	Tidyverse,topicmodel
Web Scrapping	Beautiful soup	rvest
Interfacing with other System(like Outlook)	Pywin32	RDCOMClient
Read JSON	json	rjson
Pickeling	pickle	saveRDS,readRDS
Web App (Especially for Proof of Concept)	Django	R Shiny

Below is an explanation of the contents in the table:

Data Slicing and Summary: Data filtering,sorting,summarization,etc are required in every ML exercise. In R, one can do this using functions from dplyr and data.table libraries. The pipe operator(%>%) from dplyr is specially useful as it helps in readability of a cascaded operation and in debugging. Python on the other hand has Pandas which doesn’t have a pipe operator. Thus cascaded operations on data becomes unmanageable
Visualization: ggplot and associated libraries in R helps to create highly useful plots such as histograms,geographical heat maps,Interactive and animated graphs. Python has matplotlib library for creating graphs but doesn’t provide enhancements as ggplot does
Data Set Repositories: There are a lot of data repositories in R. Users can invoked these from several libraries. Thus one can play around with the data and gain understanding. Some useful repositories include AER library that has useful census data. Python on the other hand doesn’t have any
Linear Models: R and python both have libraries that helps in application of regression models. However, there is one aspect where R stands out as a clear winner: treatment of a categorical variable. N-1 encoding is automatically taken care of in R but in python it is at the discretion of the user
Hyper parameter Tuning: Both languages offer extraction of optimal parameters using hyper parameter tuning. However, in python, one can tune more number of parameters in comparison to R. For instance in R, for a Random Forest algorithm, one can only tune number of trees, nodes and leaf size. However, using python, one can also tune in sample split parameter. More optimal parameters lead to better accuracy
Natural Language Processing(NLP): Both R as well as Python have libraries to handle text. A lot of users will vouch for Python here but having used both the software, I didn’t find any difference between the two
Web Scrapping: Python has methods from beautiful soup library to extract any element having an html tag. Things are more clearly and precisely defined in python. However, R doesn’t offer a one stop solution for extraction. A lot of libraries with no clear examples leave much to soul searching
Interfacing with other System(like Outlook): Considering Python is a programming language, system integration is pretty matured. One can use python to communicate between two different systems such as Outlook and Python terminal. The protocols that govern such a communication are already there. On the other hand R doesn’t have well defined functions to do this
Read JSON: Python takes less time to read and process a JSON file format in comparison to R. Also since text inside a JSON resembles a dictionary, using python to read and parse it makes a lot of sense
Pickling: This can be done in both Python as well as R
Web App (Especially for Proof of Concept): This can be done in both Python as well as R however, the time to create an App in R is less.