Hyperparameter tuning of a Model
Parag Verma
Introduction
In this blog we are going to discuss some of the very important concepts in Machine Learning and Data Science.These are related to how we can optimally tune our model so that it gives the best results.You must have seen that whenever a batsmen gets a new bat, they generally knock it so that they can get that perfect stroke !. In Machine Learning also, we have to identify the set of optimal values or the SWEET SPOT for our model.These are the set of values at which our model creates that perfect stroke.
Step 1: Install Libraries
We will install the mlr library (for creating the model tuning set up) and randomForest library for training a random forest model.Dataset for this blog will be Default dataset available from ISLR library
package.name<-c("dplyr","tidyr","randomForest","mlr","ISLR")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
# mtcars dataset
data(Default)
df<-Default
head(df)
default student balance income
1 No No 729.5265 44361.625
2 No Yes 817.1804 12106.135
3 No No 1073.5492 31767.139
4 No No 529.2506 35704.494
5 No No 785.6559 38463.496
6 No Yes 919.5885 7491.559
It is a simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt using a decision tree. The brief description of the columns are given below:
- default:A categorical variable with levels No and Yes indicating whether the customer defaulted on their debt
- student:A categorical variable with levels No and Yes indicating whether the customer is a student
- balance :The average balance that the customer has remaining on their credit card after making their monthly payment
- income:Income of customer
Step 2: Splitting into Train(70%) and Test(30%) Dataset
We will split the dataset into train and test and then:
- Fit model on train dataset
- Check its accuracy on test dataset
set.seed(1)
train.sample<-sample(nrow(df),0.70*nrow(df),replace = F)
train.df<-df[train.sample,]
test.sample<-sample(nrow(df),0.30*nrow(df),replace = F)
test.df<-df[test.sample,]
Checking the proportion of defaults within train and test datasets to check for uniformity
train.df%>%
group_by(default)%>%
summarise(Total_Count=n())%>%
ungroup()%>%
mutate(Perc_Val=Total_Count/sum(Total_Count))
# A tibble: 2 x 3
default Total_Count Perc_Val
<fct> <int> <dbl>
1 No 6769 0.967
2 Yes 231 0.033
The No and Yes distribution in train.df is 97 and 3%.Lets check this within test.df
test.df%>%
group_by(default)%>%
summarise(Total_Count=n())%>%
ungroup()%>%
mutate(Perc_Val=Total_Count/sum(Total_Count))
# A tibble: 2 x 3
default Total_Count Perc_Val
<fct> <int> <dbl>
1 No 2902 0.967
2 Yes 98 0.0327
In test.df as well, the distribution is the same as 97 and 3%. This check is basically to ensure that proportion of yes and no values within train and test datasets is the same
Step 3: Hyperparameter Tuning of a Random Forest
Before we start and create a model, we have to first tune the parameters to get best results out of it.You can think of how we tune the radio to get to a particular channel
There is a library in R by the name mlr which is used to tune the parameters of a Random Forest.There are couple of standard steps for tuning the paramters of the model.They are:
- Initiate a learner
- Set a grid
- Controlling the number of iterations
- Setting up cross validation
- Create a task
- Tune Parameters
The above process can be better understood using the below process flow
Step 3A: Initiate a classification learner
rf<-makeLearner("classif.randomForest",predict.type = "prob")
If the problem is a regression one, then instead of classif.randomForest it will be regr.randomForest and predict.type will be equal to response
Step 3B: Set grid parameters
The parameters that we will vary for the random forest model are the following:
- ntree:Number of trees in the model
- mtry:Number of variables to be included in the model
- nodesize:Terminal leaf nodes for individual trees in the model
As can be easily understood, the parameter value will be varied from lower bound to upper bound.
rf_param<-makeParamSet(makeIntegerParam("ntree",lower=10,upper=15),
makeIntegerParam("mtry",lower=2,upper=3),
makeIntegerParam("nodesize",lower=10,upper=15))
Step 3C: Controlling the number of iterations
rancontrol<-makeTuneControlRandom(maxit=5)
Step 3D: Setting up Cross Validation (CV)
For 3 folds cross validation
set_cv<-makeResampleDesc("CV",iter=3)
Step 3E: Create a task
credit_risk_task<-makeClassifTask(data=train.df,target="default")
Step 3F: Tune Parameters
rf_tune<-tuneParams(learner=rf,resampling = set_cv,task=credit_risk_task,
par.set=rf_param,control=rancontrol,
measure=mmce)
rf_tune
Tune result:
Op. pars: ntree=15; mtry=2; nodesize=12
mmce.test.mean=0.0275712
Step 4: Creating a Random Forest Model with the above parameters
set.seed(120) # Setting seed
default_RF = randomForest(default ~ ., data=train.df,importance=T,
ntree = 14,
mtry=3,
nodesize=11)
default_RF
Call:
randomForest(formula = default ~ ., data = train.df, importance = T, ntree = 14, mtry = 3, nodesize = 11)
Type of random forest: classification
Number of trees: 14
No. of variables tried at each split: 3
OOB estimate of error rate: 3.12%
Confusion matrix:
No Yes class.error
No 6694 73 0.01078765
Yes 145 86 0.62770563
Step 5: Calculating Accuracy on Train Data
train.df[["default_train"]]<-predict(default_RF,train.df)
head(train.df)
default student balance income default_train
1017 No No 939.0985 45519.02 No
8004 No Yes 397.5425 22710.87 No
4775 Yes No 1511.6110 53506.94 No
9725 No No 301.3194 51539.95 No
8462 No No 878.4461 29561.78 No
4050 Yes No 1673.4863 49310.33 No
xtab <- table(train.df[["default_train"]], train.df[["default"]])
xtab
No Yes
No 6748 97
Yes 21 134
(xtab[1,1] + xtab[2,2])/sum(xtab)
[1] 0.9831429
We can see that Accuracy in the train data is around 98%.
Step 6: Calculating Accuracy on Test Data
test.df[["default_test"]]<-predict(default_RF,test.df)
head(test.df)
default student balance income default_test
6033 Yes Yes 2086.5362 17893.72 Yes
8482 No Yes 947.8510 22047.92 No
8379 No No 1281.4488 48837.38 No
975 Yes No 1753.0844 48965.35 No
807 No No 418.5399 55002.73 No
1711 No No 954.5982 50139.09 No
xtab <- table(test.df[["default_test"]], test.df[["default"]])
xtab
No Yes
No 2893 53
Yes 9 45
(xtab[1,1] + xtab[2,2])/sum(xtab)
[1] 0.9793333
We can see that Accuracy in the train data is around 97%.
Accuracy of both Train and Test dataset are quite close to each other.If you look at the accuracy calculation using Logistic Regression which came out as 88%, we can certainly see that by hyper parameter tuning and use of Random Forest, the Accuracy has been improved.
In the next blog, we will look at the concept of Bagging, Boosting and how random forest compensates for the shortcomings of a Decision Tree.
Link to Youtube Channel
https://www.youtube.com/playlist?list=PL6ruVo_0cYpV2Otl1jR9iDe6jt1jLXgAq