Friday, May 7, 2021

Blog 41: Credit Risk Modelling using Decision Tree

Credit Risk Modelling using Decision Tree


Introduction

In this blog we are going to discuss the application of Decision Tree in credit risk domain.We will identify the rules that differentiates a defaulter from a non-defaulter.The library rpart will be used to create a decision tree.The aim is to also identify how different paramters of a decision tree can be properly set up using this case study


Step 1: Read Data

The Default dataset is part of ISLR package. Packages rpart is imported to create a decision tree. Lets import the dataset and look at the first few records

package.name<-c("dplyr","tidyr","ISLR","rpart","rpart.plot")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}


# ISLR package has the credit Default dataset
data(Default)
df<-Default
head(df)
  default student   balance    income
1      No      No  729.5265 44361.625
2      No     Yes  817.1804 12106.135
3      No      No 1073.5492 31767.139
4      No      No  529.2506 35704.494
5      No      No  785.6559 38463.496
6      No     Yes  919.5885  7491.559

It is a simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt using a decision tree. The brief description of the columns are given below:

  • default:A categorical variable with levels No and Yes indicating whether the customer defaulted on their debt
  • student:A categorical variable with levels No and Yes indicating whether the customer is a student
  • balance :The average balance that the customer has remaining on their credit card after making their monthly payment
  • income:Income of customer


Step 2: Splitting into Train(70%) and Test(30%) Dataset

We will split the dataset into train and test and then:

  • Fit model on train dataset
  • Check its accuracy on test dataset


set.seed(1)

train.sample<-sample(nrow(df),0.70*nrow(df),replace = F)
train.df<-df[train.sample,]

test.sample<-sample(nrow(df),0.30*nrow(df),replace = F)
test.df<-df[test.sample,]  


Checking the proportion of defaults within train and test datasets to check for uniformity

train.df%>%
  group_by(default)%>%
  summarise(Total_Count=n())%>%
  ungroup()%>%
  mutate(Perc_Val=Total_Count/sum(Total_Count))
# A tibble: 2 x 3
  default Total_Count Perc_Val
  <fct>         <int>    <dbl>
1 No             6769    0.967
2 Yes             231    0.033


The No and Yes distirbution in train.df is 97 and 3%.Lets check this within test.df

test.df%>%
  group_by(default)%>%
  summarise(Total_Count=n())%>%
  ungroup()%>%
  mutate(Perc_Val=Total_Count/sum(Total_Count))
# A tibble: 2 x 3
  default Total_Count Perc_Val
  <fct>         <int>    <dbl>
1 No             2902   0.967 
2 Yes              98   0.0327

In test.df as well, the distibution is the same as 97 and 3%. This check is basically to ensure that proportion of yes and no values within train and test datasets is the same


Step 3: Training a Decision Tree

Lets create a decision tree and discuss the paramters that are used for setting it up

m1 <- rpart(default ~ ., data = train.df,method = "class")
rpart.plot(m1, type = 3)

Some key points to note:

  • The formulae defines how the Dependent variable(Default in our case) is related to the rest of the variables in dataset

  • The method argument mostly takes two values: anova for regression and class for classification

  • The rpart.control argument determines how the branching will take place and how the tree will be formed.As an example, rpart.control(minsplit=30, cp=0.001) indicates that the minimum samples in the nodes should be atlest 30 for a split to happen and that the split must decrease the overall lack of fit(or should improve the fil) by factor of 0.001. cp parameter helps in identifying the splits that are not necessary and thus saves computation time.We will discuss this in more detail in the later sections of the blog

Lets now look at the rules generated from the decision tree model

rpart.rules(m1,cover = TRUE)
 default                                                  cover
    0.02 when balance <  1789                               97%
    0.28 when balance is 1789 to 1974 & income <  33212      1%
    0.63 when balance is 1789 to 1974 & income >= 33212      1%
    0.72 when balance >=         1974                        1%

As shown above, there are 4 rules with the following features:

  • default indicates the probability of default obtained by following the rules mentioned in the branch
  • cover indicates the total proportion of records present within the branch


Step 4: Checking Accuracy on Train data

predicted_matrix<-predict(m1,train.df)
default_predicted<-colnames(predicted_matrix)[apply(predicted_matrix,1,which.max)]
train.df[["default_predicted"]]<-default_predicted


Getting the Accuracy metric

xtab <- as.matrix(table(train.df[["default_predicted"]], train.df[["default"]]))
xtab
     
        No  Yes
  No  6728  137
  Yes   41   94
accuracy<-(xtab[1,1] + xtab[2,2])/sum(xtab)
accuracy
[1] 0.9745714

So on the train dataset, we have an accuracy of 97%. Lets check it on test dataset


Step 5: Checking Accuracy on Test data

predicted_matrix_test<-predict(m1,test.df)
default_predicted_test<-colnames(predicted_matrix_test)[apply(predicted_matrix_test,1,which.max)]
test.df[["default_predicted_test"]]<-default_predicted_test

Getting the Accuracy metric

xtab <- as.matrix(table(test.df[["default_predicted_test"]], test.df[["default"]]))
xtab
     
        No  Yes
  No  2890   60
  Yes   12   38
accuracy<-(xtab[1,1] + xtab[2,2])/sum(xtab)
accuracy
[1] 0.976

So on the test dataset also we have got an accuracy of 97%. So the model is working fine


Step 6: Parameter decsription in rpart.control

Now lets discuss one of the most important argument in the decision tree model-rpart.control.The paramters as well as the usage is briefly described below:

  • minsplit : It represents the minimum number of observation that must be present in a node before it can split further

  • cp : Complexity parameter is specified to remove the splits that doesnt result in model improvement.It we specify the value of cp as 0.01, then for method = “class”, each split should result in increase in Information gain by 0.01 units. If the method =“regression”, then each split should result in an increase of R-squared by 0.01 units

  • xval : number of cross-validations.Generally its value is taken as 3,5..

  • maxdepth : Number of levels in a decision tree. If its value is 4, then the depth of a decision tree will not be greater than 4

There are other parameters as well in rpart.control but the ones that are important in terms of its impact on the structure and fit have been described above.

Now lets see if changing the values within rpart.control improves the accuracy of the model


Step 7: Improvement in fit/change in DT by varying parameters in rpart.control

train.df$default_predicted<-NULL
m2 <- rpart(default ~ ., data = train.df,method = "class",
            control  = rpart.control(minsplit = 40, cp = 0.05, xval=5, maxdepth = 5))
rpart.plot(m2, type = 3)

predicted_matrix<-predict(m2,train.df)
default_predicted<-colnames(predicted_matrix)[apply(predicted_matrix,1,which.max)]
train.df[["default_predicted"]]<-default_predicted

Getting the Accuracy metric

xtab <- as.matrix(table(train.df[["default_predicted"]], train.df[["default"]]))
xtab
     
        No  Yes
  No  6743  163
  Yes   26   68
accuracy<-(xtab[1,1] + xtab[2,2])/sum(xtab)
accuracy
[1] 0.973

Salient features of the above model:

  • It is less complicated.Uses few variables for prediction
  • Drawback:It doesnt use income which might be a strong predictor
  • Both accuracy are same so if the objective is to create a simple yet powerful system, then we should go for m2 otherwise m1


Step 8: Pruning of trees

Pruning/trimming of a decision tree is basically done to avoid overfitting. Overfitting is a issue that happens when the algorithm continues to fit on the training dataset.This results in an improved accuracy on the train dataset but the model performs badly(in terms of accuracy) on the test dataset.

Pruning helps to reduce overfitting by reducing the growth of the tree before it classifies the data to a record level.This means that if the terminal leaves of the tree are equal to the number of records, it means that the model has been overfitted.

Overfitting can be removed by using the following things:

  • Pruning the tree
  • Removing the variable(s) that is actually an identifier/key which has been mistakenly added to the model
  • Removing a variable that has large number of unique values

Lets see how we can prune the decision tree based on our dataset

pfit<- prune(m1, cp=   m1$cptable[which.min(m1$cptable[,"xerror"]),"CP"])

# plot the pruned tree
rpart.plot(pfit, type = 3)


We can see that the pruned tree is similar to the one obtained through model m2(using rpart.control argument)

Final Comments

In this blog we saw a practical applciation of Decision Tree in the credit risk domain.We learnt about its parameters, overfitting, pruning and so on.We will continue to explore further and discuss the drawbacks of decision tree and how random forest can be used to overcome them in the coming blogs

Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...