Machine Learning Made Easy

Credit Risk Modelling using Decision Tree

Introduction

In this blog we are going to discuss the application of Decision Tree in credit risk domain.We will identify the rules that differentiates a defaulter from a non-defaulter.The library rpart will be used to create a decision tree.The aim is to also identify how different paramters of a decision tree can be properly set up using this case study

Step 1: Read Data

The Default dataset is part of ISLR package. Packages rpart is imported to create a decision tree. Lets import the dataset and look at the first few records

package.name<-c("dplyr","tidyr","ISLR","rpart","rpart.plot")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}


# ISLR package has the credit Default dataset
data(Default)
df<-Default
head(df)

  default student   balance    income
1      No      No  729.5265 44361.625
2      No     Yes  817.1804 12106.135
3      No      No 1073.5492 31767.139
4      No      No  529.2506 35704.494
5      No      No  785.6559 38463.496
6      No     Yes  919.5885  7491.559

It is a simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt using a decision tree. The brief description of the columns are given below:

default:A categorical variable with levels No and Yes indicating whether the customer defaulted on their debt
student:A categorical variable with levels No and Yes indicating whether the customer is a student
balance :The average balance that the customer has remaining on their credit card after making their monthly payment
income:Income of customer

Step 2: Splitting into Train(70%) and Test(30%) Dataset

We will split the dataset into train and test and then:

Fit model on train dataset
Check its accuracy on test dataset

set.seed(1)

train.sample<-sample(nrow(df),0.70*nrow(df),replace = F)
train.df<-df[train.sample,]

test.sample<-sample(nrow(df),0.30*nrow(df),replace = F)
test.df<-df[test.sample,]

Checking the proportion of defaults within train and test datasets to check for uniformity

train.df%>%
  group_by(default)%>%
  summarise(Total_Count=n())%>%
  ungroup()%>%
  mutate(Perc_Val=Total_Count/sum(Total_Count))

# A tibble: 2 x 3
  default Total_Count Perc_Val
  <fct>         <int>    <dbl>
1 No             6769    0.967
2 Yes             231    0.033

The No and Yes distirbution in train.df is 97 and 3%.Lets check this within test.df

test.df%>%
  group_by(default)%>%
  summarise(Total_Count=n())%>%
  ungroup()%>%
  mutate(Perc_Val=Total_Count/sum(Total_Count))

# A tibble: 2 x 3
  default Total_Count Perc_Val
  <fct>         <int>    <dbl>
1 No             2902   0.967 
2 Yes              98   0.0327

In test.df as well, the distibution is the same as 97 and 3%. This check is basically to ensure that proportion of yes and no values within train and test datasets is the same

Step 3: Training a Decision Tree

Lets create a decision tree and discuss the paramters that are used for setting it up

m1 <- rpart(default ~ ., data = train.df,method = "class")
rpart.plot(m1, type = 3)

Some key points to note:

The formulae defines how the Dependent variable(Default in our case) is related to the rest of the variables in dataset
The method argument mostly takes two values: anova for regression and class for classification
The rpart.control argument determines how the branching will take place and how the tree will be formed.As an example, rpart.control(minsplit=30, cp=0.001) indicates that the minimum samples in the nodes should be atlest 30 for a split to happen and that the split must decrease the overall lack of fit(or should improve the fil) by factor of 0.001. cp parameter helps in identifying the splits that are not necessary and thus saves computation time.We will discuss this in more detail in the later sections of the blog

Lets now look at the rules generated from the decision tree model

rpart.rules(m1,cover = TRUE)

 default                                                  cover
    0.02 when balance <  1789                               97%
    0.28 when balance is 1789 to 1974 & income <  33212      1%
    0.63 when balance is 1789 to 1974 & income >= 33212      1%
    0.72 when balance >=         1974                        1%

As shown above, there are 4 rules with the following features:

default indicates the probability of default obtained by following the rules mentioned in the branch
cover indicates the total proportion of records present within the branch

Step 4: Checking Accuracy on Train data

predicted_matrix<-predict(m1,train.df)
default_predicted<-colnames(predicted_matrix)[apply(predicted_matrix,1,which.max)]
train.df[["default_predicted"]]<-default_predicted

Getting the Accuracy metric

xtab <- as.matrix(table(train.df[["default_predicted"]], train.df[["default"]]))
xtab

     
        No  Yes
  No  6728  137
  Yes   41   94

accuracy<-(xtab[1,1] + xtab[2,2])/sum(xtab)
accuracy

[1] 0.9745714

So on the train dataset, we have an accuracy of 97%. Lets check it on test dataset

Step 5: Checking Accuracy on Test data

predicted_matrix_test<-predict(m1,test.df)
default_predicted_test<-colnames(predicted_matrix_test)[apply(predicted_matrix_test,1,which.max)]
test.df[["default_predicted_test"]]<-default_predicted_test

Getting the Accuracy metric

xtab <- as.matrix(table(test.df[["default_predicted_test"]], test.df[["default"]]))
xtab

     
        No  Yes
  No  2890   60
  Yes   12   38

accuracy<-(xtab[1,1] + xtab[2,2])/sum(xtab)
accuracy

[1] 0.976

So on the test dataset also we have got an accuracy of 97%. So the model is working fine

Step 6: Parameter decsription in rpart.control

Now lets discuss one of the most important argument in the decision tree model-rpart.control.The paramters as well as the usage is briefly described below:

minsplit : It represents the minimum number of observation that must be present in a node before it can split further
cp : Complexity parameter is specified to remove the splits that doesnt result in model improvement.It we specify the value of cp as 0.01, then for method = “class”, each split should result in increase in Information gain by 0.01 units. If the method =“regression”, then each split should result in an increase of R-squared by 0.01 units
xval : number of cross-validations.Generally its value is taken as 3,5..
maxdepth : Number of levels in a decision tree. If its value is 4, then the depth of a decision tree will not be greater than 4

There are other parameters as well in rpart.control but the ones that are important in terms of its impact on the structure and fit have been described above.

Now lets see if changing the values within rpart.control improves the accuracy of the model

Step 7: Improvement in fit/change in DT by varying parameters in rpart.control

train.df$default_predicted<-NULL
m2 <- rpart(default ~ ., data = train.df,method = "class",
            control  = rpart.control(minsplit = 40, cp = 0.05, xval=5, maxdepth = 5))
rpart.plot(m2, type = 3)

predicted_matrix<-predict(m2,train.df)
default_predicted<-colnames(predicted_matrix)[apply(predicted_matrix,1,which.max)]
train.df[["default_predicted"]]<-default_predicted

Getting the Accuracy metric

xtab <- as.matrix(table(train.df[["default_predicted"]], train.df[["default"]]))
xtab

     
        No  Yes
  No  6743  163
  Yes   26   68

accuracy<-(xtab[1,1] + xtab[2,2])/sum(xtab)
accuracy

[1] 0.973

Salient features of the above model:

It is less complicated.Uses few variables for prediction
Drawback:It doesnt use income which might be a strong predictor
Both accuracy are same so if the objective is to create a simple yet powerful system, then we should go for m2 otherwise m1

Step 8: Pruning of trees

Pruning/trimming of a decision tree is basically done to avoid overfitting. Overfitting is a issue that happens when the algorithm continues to fit on the training dataset.This results in an improved accuracy on the train dataset but the model performs badly(in terms of accuracy) on the test dataset.

Pruning helps to reduce overfitting by reducing the growth of the tree before it classifies the data to a record level.This means that if the terminal leaves of the tree are equal to the number of records, it means that the model has been overfitted.

Overfitting can be removed by using the following things:

Pruning the tree
Removing the variable(s) that is actually an identifier/key which has been mistakenly added to the model
Removing a variable that has large number of unique values

Lets see how we can prune the decision tree based on our dataset

pfit<- prune(m1, cp=   m1$cptable[which.min(m1$cptable[,"xerror"]),"CP"])

# plot the pruned tree
rpart.plot(pfit, type = 3)

We can see that the pruned tree is similar to the one obtained through model m2(using rpart.control argument)

Final Comments

In this blog we saw a practical applciation of Decision Tree in the credit risk domain.We learnt about its parameters, overfitting, pruning and so on.We will continue to explore further and discuss the drawbacks of decision tree and how random forest can be used to overcome them in the coming blogs

Link to Previous R Blogs

R complete Guide

All other Blog

Link to Youtube Channel

https://www.youtube.com/playlist?list=PL6ruVo_0cYpV2Otl1jR9iDe6jt1jLXgAq

Machine Learning Made Easy

Friday, May 7, 2021

Blog 41: Credit Risk Modelling using Decision Tree

Credit Risk Modelling using Decision Tree

Parag Verma

Introduction

Step 1: Read Data

Step 2: Splitting into Train(70%) and Test(30%) Dataset

Step 3: Training a Decision Tree

Step 4: Checking Accuracy on Train data

Step 5: Checking Accuracy on Test data

Step 6: Parameter decsription in rpart.control

Step 7: Improvement in fit/change in DT by varying parameters in rpart.control

Step 8: Pruning of trees

Final Comments

Link to Previous R Blogs

Link to Youtube Channel

Identify customer visit information