Wednesday, October 24, 2018

Unsupervised Vs Supervised: Battle of ML

Batman Vs Superman: Who's better ? DC fans world over break their heads over it. While one is a demi god, the other is a master strategist. In all comics and movie adaptations batsman has beaten Man of Steel hands down. Despite Man of Steels's krytonian powers and herculean machismo ,bat vigilante never misses a beat to pip him in all departments of combat. This is because all adventures demand romance at the expense of powerful characters. Depicting someone overcome hurdles and go the distance  brings hope into the otherwise despair world. There is something beautiful about how an underdog finds his ways up the world order. It is indeed poetic to find something so powerless yet so powerful. This comparison rallies an important point: all confrontations are unfair. Even though Man of Steel and Dark Knight are part of same cinematic universe and work astride in Gotham, their comparison is grossly unjustified. In the context of defending the city against goons, batman's solution are practical and seems to work. Even though superman is brute power and adrenaline, his settings are mostly fights with extra terrestrial or against his own kind. This makes his powers/ methods irrelevant for run of the mill problems. Even though the above excerpt is modeled around comic book characters, they are so true for understanding which Machine Learning method is most relevant and sought after.  In the course of this discussion, we will look at how Unsupervised ML fair against Supervised ML and what problems a typical Data Scientist face at workplace.



Course curriculum the world over has taught us to tackle structured problems. But problems are seldom structured in nature. That's why there is a huge gap between academia and industry. There is an altogether different host of skills a person needs to acquire to survive in an industrial set up. Especially for an industry like Data Science which is continuously evolving, a person needs to have skills to identify THE PROBLEM and convert it into a structured one. Most of us are given a very open ended problem where the stakeholder wants to do something with the data. This has been depicted below in the form of a caricature.


In this situations, Supervised Machine Learning Algorithms seldom helps as for applying them the problem has to be very precise with the need to have an Independent set of Features affecting a given target(s).For almost all problems, Unsupervised MLs provide some respite. All these algorithms are based on identifying patterns, customer segments, dimension reduction, association rules, etc. Ease of application also brings in the prejudice of subjectivity. There are very few diagnostic measures that can be used to ascertain the effectiveness of an Unsupervised ML. With the absence of metrics like accuracy, mape, p value, etc, the onus of ensuring that the method clocks in the desired results really lies with the Data Scientist.

       I am listing some of the widely used Unsupervised MLs along with the relevant industry and use case.

Algorithm/Method
Domain
Use Case
Clustering
Marketing
Identify natural groups within customers to customize marketing campaigns
Principle Component Analysi(PCA)
Marketing
Generally used in Survey data to reduce the number of variables
Market Basket Analysis
Retail
To identify product based rules and association between items
Multi Collaborative Filtering
Retail
To identify product based rules and association between items and impact of demographics
Topic Modelling
Sales
To identify the category into which a particular purchase falls based on the description of item
Density Based Methods
Finance/HR
To identify fraudulent expenses report submitted by an Employee
Histogram based Outlier Scoring(HBOS)
Finance
To identify anomalous transaction
KNN
Retail
Recommend similar items based on user profile

1 comment:

  1. Does'reducing the number of variable' in PCA means selecting significant over non-significant ones?

    ReplyDelete

Embed Shiny

Please wait...