Machine Learning Made Easy

Hyperparameter tuning of a Model

Introduction

In this blog we are going to discuss some of the very important concepts in Machine Learning and Data Science.These are related to how we can optimally tune our model so that it gives the best results.You must have seen that whenever a batsmen gets a new bat, they generally knock it so that they can get that perfect stroke !. In Machine Learning also, we have to identify the set of optimal values or the SWEET SPOT for our model.These are the set of values at which our model creates that perfect stroke.

Step 1: Install Libraries

We will install the mlr library (for creating the model tuning set up) and randomForest library for training a random forest model.Dataset for this blog will be Default dataset available from ISLR library

package.name<-c("dplyr","tidyr","randomForest","mlr","ISLR")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

# mtcars dataset
data(Default)
df<-Default
head(df)

  default student   balance    income
1      No      No  729.5265 44361.625
2      No     Yes  817.1804 12106.135
3      No      No 1073.5492 31767.139
4      No      No  529.2506 35704.494
5      No      No  785.6559 38463.496
6      No     Yes  919.5885  7491.559

It is a simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt using a decision tree. The brief description of the columns are given below:

default:A categorical variable with levels No and Yes indicating whether the customer defaulted on their debt
student:A categorical variable with levels No and Yes indicating whether the customer is a student
balance :The average balance that the customer has remaining on their credit card after making their monthly payment
income:Income of customer

Step 2: Splitting into Train(70%) and Test(30%) Dataset

We will split the dataset into train and test and then:

Fit model on train dataset
Check its accuracy on test dataset

set.seed(1)

train.sample<-sample(nrow(df),0.70*nrow(df),replace = F)
train.df<-df[train.sample,]

test.sample<-sample(nrow(df),0.30*nrow(df),replace = F)
test.df<-df[test.sample,]

Checking the proportion of defaults within train and test datasets to check for uniformity

train.df%>%
  group_by(default)%>%
  summarise(Total_Count=n())%>%
  ungroup()%>%
  mutate(Perc_Val=Total_Count/sum(Total_Count))

# A tibble: 2 x 3
  default Total_Count Perc_Val
  <fct>         <int>    <dbl>
1 No             6769    0.967
2 Yes             231    0.033

The No and Yes distribution in train.df is 97 and 3%.Lets check this within test.df

test.df%>%
  group_by(default)%>%
  summarise(Total_Count=n())%>%
  ungroup()%>%
  mutate(Perc_Val=Total_Count/sum(Total_Count))

# A tibble: 2 x 3
  default Total_Count Perc_Val
  <fct>         <int>    <dbl>
1 No             2902   0.967 
2 Yes              98   0.0327

In test.df as well, the distribution is the same as 97 and 3%. This check is basically to ensure that proportion of yes and no values within train and test datasets is the same

Step 3: Hyperparameter Tuning of a Random Forest

Before we start and create a model, we have to first tune the parameters to get best results out of it.You can think of how we tune the radio to get to a particular channel

There is a library in R by the name mlr which is used to tune the parameters of a Random Forest.There are couple of standard steps for tuning the paramters of the model.They are:

Initiate a learner
Set a grid
Controlling the number of iterations
Setting up cross validation
Create a task
Tune Parameters

The above process can be better understood using the below process flow

Step 3A: Initiate a classification learner

rf<-makeLearner("classif.randomForest",predict.type = "prob")

If the problem is a regression one, then instead of classif.randomForest it will be regr.randomForest and predict.type will be equal to response

Step 3B: Set grid parameters

The parameters that we will vary for the random forest model are the following:

ntree:Number of trees in the model
mtry:Number of variables to be included in the model
nodesize:Terminal leaf nodes for individual trees in the model

As can be easily understood, the parameter value will be varied from lower bound to upper bound.

rf_param<-makeParamSet(makeIntegerParam("ntree",lower=10,upper=15),
                       makeIntegerParam("mtry",lower=2,upper=3),
                       makeIntegerParam("nodesize",lower=10,upper=15))

Step 3C: Controlling the number of iterations

rancontrol<-makeTuneControlRandom(maxit=5)

Step 3D: Setting up Cross Validation (CV)

For 3 folds cross validation

set_cv<-makeResampleDesc("CV",iter=3)

Step 3E: Create a task

credit_risk_task<-makeClassifTask(data=train.df,target="default")

Step 3F: Tune Parameters

rf_tune<-tuneParams(learner=rf,resampling = set_cv,task=credit_risk_task,
                    par.set=rf_param,control=rancontrol,
                    measure=mmce)

rf_tune

Tune result:
Op. pars: ntree=15; mtry=2; nodesize=12
mmce.test.mean=0.0275712

Step 4: Creating a Random Forest Model with the above parameters

set.seed(120)  # Setting seed

default_RF = randomForest(default ~ ., data=train.df,importance=T,
                          ntree = 14,
                          mtry=3,
                          nodesize=11)


default_RF


Call:
 randomForest(formula = default ~ ., data = train.df, importance = T,      ntree = 14, mtry = 3, nodesize = 11) 
               Type of random forest: classification
                     Number of trees: 14
No. of variables tried at each split: 3

        OOB estimate of  error rate: 3.12%
Confusion matrix:
      No Yes class.error
No  6694  73  0.01078765
Yes  145  86  0.62770563

Step 5: Calculating Accuracy on Train Data

train.df[["default_train"]]<-predict(default_RF,train.df)
head(train.df)

     default student   balance   income default_train
1017      No      No  939.0985 45519.02            No
8004      No     Yes  397.5425 22710.87            No
4775     Yes      No 1511.6110 53506.94            No
9725      No      No  301.3194 51539.95            No
8462      No      No  878.4461 29561.78            No
4050     Yes      No 1673.4863 49310.33            No

xtab <- table(train.df[["default_train"]], train.df[["default"]])
xtab

     
        No  Yes
  No  6748   97
  Yes   21  134

(xtab[1,1] + xtab[2,2])/sum(xtab)

[1] 0.9831429

We can see that Accuracy in the train data is around 98%.

Step 6: Calculating Accuracy on Test Data

test.df[["default_test"]]<-predict(default_RF,test.df)
head(test.df)

     default student   balance   income default_test
6033     Yes     Yes 2086.5362 17893.72          Yes
8482      No     Yes  947.8510 22047.92           No
8379      No      No 1281.4488 48837.38           No
975      Yes      No 1753.0844 48965.35           No
807       No      No  418.5399 55002.73           No
1711      No      No  954.5982 50139.09           No

xtab <- table(test.df[["default_test"]], test.df[["default"]])
xtab

     
        No  Yes
  No  2893   53
  Yes    9   45

(xtab[1,1] + xtab[2,2])/sum(xtab)

[1] 0.9793333

We can see that Accuracy in the train data is around 97%.

Accuracy of both Train and Test dataset are quite close to each other.If you look at the accuracy calculation using Logistic Regression which came out as 88%, we can certainly see that by hyper parameter tuning and use of Random Forest, the Accuracy has been improved.

In the next blog, we will look at the concept of Bagging, Boosting and how random forest compensates for the shortcomings of a Decision Tree.

Link to Previous R Blogs

R complete Guide

All other Blog

Link to Youtube Channel

https://www.youtube.com/playlist?list=PL6ruVo_0cYpV2Otl1jR9iDe6jt1jLXgAq