Thursday, November 15, 2018

R Vs Python

Why doesn't Cheteshwar Pujara come one down in a 50 over match ?. Or why doesn’t Aswin bowl flighted deliveries in limited over matches ? Despite being the ‘Hitman’ of ODI, Rohit Sharma seldom makes the cut in a 5 day contest. Are the required skill set different for a test match against a one dayer or for that matter a T20 ? All of this can be answered with a simple phrase-horses for courses. You pick players that fit the bill. All selection has to be made in accordance with the requirements. That’s precise isn’t it ! Same thing can be carried to usage of software for ML. This blog particularly explores the scenarios which are more conducive to the usage of Python against R and vice-versa.
               Before we start, a little background of R and Python language is necessary in light of Machine Learning(ML). Python is a computer programming language. In the wake of recent advances in ML, python community contributed several libraries that enables one to play around with data. However, R as such is a typical Statistical programming language like Matlab. It was developed to cater to the Math community in the first place. There is lot of debate on which language is best and what to prefer for ML.The exasperation is aptly shown in the below image.


















The table below highlights the key differences between Python and R wrt certain commonly used practices in ML. The entries in the cell indicates the library and/or function used to execute the requirement. The colored grid indicates the superiority of a given language over the other. In case of a tie, both the cells are colored.


Functionality
Python
R
Data Slicing and Summary
pandas
Dplyr, data.table
Visualization
matplotlib
ggplot
Data Set Repositories
NA
Economteric Data: AER library
Linear Models (Regression family)
Scikit learn
Car glm
Hyperparamter Tuning
makeLearner
GridSearchCV
Natural Language Processing(NLP)
NLTK, gensim
Tidyverse,topicmodel
Web Scrapping
Beautiful soup
rvest
Interfacing with other System(like Outlook)
Pywin32
RDCOMClient
Read JSON
json
rjson
Pickeling
pickle
saveRDS,readRDS
Web App (Especially for Proof of Concept)
Django
R Shiny



Below is an explanation of the contents in the table:

  1. Data Slicing and Summary: Data filtering,sorting,summarization,etc are required in every ML exercise. In R, one can do this using functions from dplyr and data.table libraries. The pipe operator(%>%) from dplyr is specially useful as it helps in readability of a cascaded operation and in debugging. Python on the other hand has Pandas which doesn’t have a pipe operator. Thus cascaded operations on data becomes unmanageable
  2. Visualization: ggplot and associated libraries in R helps to create highly useful plots such as histograms,geographical heat maps,Interactive and animated graphs. Python has matplotlib library for creating graphs but doesn’t provide enhancements as ggplot does
  3. Data Set Repositories: There are a lot of data repositories in R. Users can invoked these from several libraries. Thus one can play around with the data and gain understanding. Some useful repositories include AER library that has useful census data. Python on the other hand doesn’t have any
  4. Linear Models: R and python both have libraries that helps in application of regression models. However, there is one aspect where R stands out as a clear winner: treatment of a categorical variable. N-1 encoding is automatically taken care of in R but in python it is at the discretion of the user
  5. Hyper parameter Tuning: Both languages offer extraction of optimal parameters using hyper parameter tuning. However, in python, one can tune more number of parameters in comparison to R. For instance in R, for a Random Forest algorithm, one can only tune number of trees, nodes and leaf size. However, using python, one can also tune in sample split parameter. More optimal parameters lead to better accuracy
  6. Natural Language Processing(NLP): Both R as well as Python have libraries to handle text. A lot of users will vouch for Python here but having used both the software, I didn’t find any difference between the two
  7. Web Scrapping: Python has methods from beautiful soup library to extract any element having an html tag. Things are more clearly and precisely defined in python. However, R doesn’t offer a one stop solution for extraction. A lot of libraries with no clear examples leave much to soul searching
  8. Interfacing with other System(like Outlook): Considering Python is a programming language, system integration is pretty matured. One can use python to communicate between two different systems such as Outlook and Python terminal. The protocols that govern such a communication are already there. On the other hand R doesn’t have well defined functions to do this
  9. Read JSON: Python takes less time to read and process a JSON file format in comparison to R. Also since text inside a JSON resembles a dictionary, using python to read and parse it makes a lot of sense
  10. Pickling: This can be done in both Python as well as R
  11. Web App (Especially for Proof of Concept): This can be done in both Python as well as R however, the time to create an App in R is less.

Thursday, November 1, 2018

Google's view on ML

What does the following names invoke in you: Sunil Gavaskar, Rod Laver, Alfred Nobel, Mahatma Gandhi, John Maynard Keynes, Bill Gates, Elon Musk etc. Each of them have been pioneers in their respective fields. Many aspirants have used their life stories as a template to begin theirs. When they say, the whole world lend an ear. Their prowess has been so respected that their opinions become axioms. If we talk about world leaders in technology,it would not be incorrect to assume that Google has been at the forefront of Machine Learning for years and has used it to build a lot of its product such as key word search, ad word analytics, Cloud AI etc. This blog will highlight where does ML fit as a cog in the Google scheme of things.

From the horses's mouth


















The discussion henceforth will be based upon the paper published in 2015 by Google employees by the name 'Hidden Technical Debt in Machine Learning Systems'.It discusses ML as something that causes significant 'Technical Debt'. This means the amount of effort, cost and infrastructure that goes in to keep the ML system up and running during the monitoring phase is huge. Much to our dislike, it categorically stresses on the fact that ML Code is a tiny little part in the entire ensemble. This can be seen from the diagram below









As seen in the above block, ML(tiny block at the center) seems to be overwhelmed by the presence of other facilitators. But is this depiction oversimplified ?...Does it mean that ML is just the ghost in the machine?... Lets see how we can understand this.


  1. Data Collection, Verification, Feature Extraction and Analysis tools are all part of a Machine Learning set up. It has been well documented that type of data to a certain extent governs what ML algorithm will be applied to it. What then follows is a group of steps including, but not limited to, data cleaning, verification, feature extraction etc.
  2. Configuration and Serving infrastructure all come under the purview of Network Architectures. It is a given that these activities are undertaken once at the start where necessity to create a system dictates proceedings. The system has to be then monitored to ensure seamlessness. However, an ML needs to be monitored daily and changes incorporated as and when there are modification in requirements. 
  3. Monitoring an ML requires more effort in comparison to a Network. ML demands continuous review to keep pace with the changing dynamics. Network compositions are often static which can be upgraded quite easily. Using a tiny block of ML as a placeholder for all the minutiae activities is not justified. 
  4. If we were to remove the ML block, will the set up even make sense. It will be labelled as a collection of fancy things that do nothing. It is like assembling a laptop but sans a processor.
On the funnier vein of things, I think the paper must have been written by Warehousing and Big Data professionals. There is an unending fight between experts who claim to create something tangible against those they claim are superficial. 


Embed Shiny

Please wait...