Credit Risk Modelling using Decision Tree
Parag Verma
Introduction
In this blog we are going to discuss the application of Decision Tree in credit risk domain.We will identify the rules that differentiates a defaulter from a non-defaulter.The library rpart will be used to create a decision tree.The aim is to also identify how different paramters of a decision tree can be properly set up using this case study
Step 1: Read Data
The Default dataset is part of ISLR package. Packages rpart is imported to create a decision tree. Lets import the dataset and look at the first few records
package.name<-c("dplyr","tidyr","ISLR","rpart","rpart.plot")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
# ISLR package has the credit Default dataset
data(Default)
df<-Default
head(df)
default student balance income
1 No No 729.5265 44361.625
2 No Yes 817.1804 12106.135
3 No No 1073.5492 31767.139
4 No No 529.2506 35704.494
5 No No 785.6559 38463.496
6 No Yes 919.5885 7491.559
It is a simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt using a decision tree. The brief description of the columns are given below:
- default:A categorical variable with levels No and Yes indicating whether the customer defaulted on their debt
- student:A categorical variable with levels No and Yes indicating whether the customer is a student
- balance :The average balance that the customer has remaining on their credit card after making their monthly payment
- income:Income of customer
Step 2: Splitting into Train(70%) and Test(30%) Dataset
We will split the dataset into train and test and then:
- Fit model on train dataset
- Check its accuracy on test dataset
set.seed(1)
train.sample<-sample(nrow(df),0.70*nrow(df),replace = F)
train.df<-df[train.sample,]
test.sample<-sample(nrow(df),0.30*nrow(df),replace = F)
test.df<-df[test.sample,]
Checking the proportion of defaults within train and test datasets to check for uniformity
train.df%>%
group_by(default)%>%
summarise(Total_Count=n())%>%
ungroup()%>%
mutate(Perc_Val=Total_Count/sum(Total_Count))
# A tibble: 2 x 3
default Total_Count Perc_Val
<fct> <int> <dbl>
1 No 6769 0.967
2 Yes 231 0.033
The No and Yes distirbution in train.df is 97 and 3%.Lets check this within test.df
test.df%>%
group_by(default)%>%
summarise(Total_Count=n())%>%
ungroup()%>%
mutate(Perc_Val=Total_Count/sum(Total_Count))
# A tibble: 2 x 3
default Total_Count Perc_Val
<fct> <int> <dbl>
1 No 2902 0.967
2 Yes 98 0.0327
In test.df as well, the distibution is the same as 97 and 3%. This check is basically to ensure that proportion of yes and no values within train and test datasets is the same
Step 3: Training a Decision Tree
Lets create a decision tree and discuss the paramters that are used for setting it up
m1 <- rpart(default ~ ., data = train.df,method = "class")
rpart.plot(m1, type = 3)
Some key points to note:
The formulae defines how the Dependent variable(Default in our case) is related to the rest of the variables in dataset
The method argument mostly takes two values: anova for regression and class for classification
The rpart.control argument determines how the branching will take place and how the tree will be formed.As an example, rpart.control(minsplit=30, cp=0.001) indicates that the minimum samples in the nodes should be atlest 30 for a split to happen and that the split must decrease the overall lack of fit(or should improve the fil) by factor of 0.001. cp parameter helps in identifying the splits that are not necessary and thus saves computation time.We will discuss this in more detail in the later sections of the blog
Lets now look at the rules generated from the decision tree model
rpart.rules(m1,cover = TRUE)
default cover
0.02 when balance < 1789 97%
0.28 when balance is 1789 to 1974 & income < 33212 1%
0.63 when balance is 1789 to 1974 & income >= 33212 1%
0.72 when balance >= 1974 1%
As shown above, there are 4 rules with the following features:
- default indicates the probability of default obtained by following the rules mentioned in the branch
- cover indicates the total proportion of records present within the branch
Step 4: Checking Accuracy on Train data
predicted_matrix<-predict(m1,train.df)
default_predicted<-colnames(predicted_matrix)[apply(predicted_matrix,1,which.max)]
train.df[["default_predicted"]]<-default_predicted
Getting the Accuracy metric
xtab <- as.matrix(table(train.df[["default_predicted"]], train.df[["default"]]))
xtab
No Yes
No 6728 137
Yes 41 94
accuracy<-(xtab[1,1] + xtab[2,2])/sum(xtab)
accuracy
[1] 0.9745714
So on the train dataset, we have an accuracy of 97%. Lets check it on test dataset
Step 5: Checking Accuracy on Test data
predicted_matrix_test<-predict(m1,test.df)
default_predicted_test<-colnames(predicted_matrix_test)[apply(predicted_matrix_test,1,which.max)]
test.df[["default_predicted_test"]]<-default_predicted_test
Getting the Accuracy metric
xtab <- as.matrix(table(test.df[["default_predicted_test"]], test.df[["default"]]))
xtab
No Yes
No 2890 60
Yes 12 38
accuracy<-(xtab[1,1] + xtab[2,2])/sum(xtab)
accuracy
[1] 0.976
So on the test dataset also we have got an accuracy of 97%. So the model is working fine
Step 6: Parameter decsription in rpart.control
Now lets discuss one of the most important argument in the decision tree model-rpart.control.The paramters as well as the usage is briefly described below:
minsplit : It represents the minimum number of observation that must be present in a node before it can split further
cp : Complexity parameter is specified to remove the splits that doesnt result in model improvement.It we specify the value of cp as 0.01, then for method = “class”, each split should result in increase in Information gain by 0.01 units. If the method =“regression”, then each split should result in an increase of R-squared by 0.01 units
xval : number of cross-validations.Generally its value is taken as 3,5..
maxdepth : Number of levels in a decision tree. If its value is 4, then the depth of a decision tree will not be greater than 4
There are other parameters as well in rpart.control but the ones that are important in terms of its impact on the structure and fit have been described above.
Now lets see if changing the values within rpart.control improves the accuracy of the model
Step 7: Improvement in fit/change in DT by varying parameters in rpart.control
train.df$default_predicted<-NULL
m2 <- rpart(default ~ ., data = train.df,method = "class",
control = rpart.control(minsplit = 40, cp = 0.05, xval=5, maxdepth = 5))
rpart.plot(m2, type = 3)
predicted_matrix<-predict(m2,train.df)
default_predicted<-colnames(predicted_matrix)[apply(predicted_matrix,1,which.max)]
train.df[["default_predicted"]]<-default_predicted
Getting the Accuracy metric
xtab <- as.matrix(table(train.df[["default_predicted"]], train.df[["default"]]))
xtab
No Yes
No 6743 163
Yes 26 68
accuracy<-(xtab[1,1] + xtab[2,2])/sum(xtab)
accuracy
[1] 0.973
Salient features of the above model:
- It is less complicated.Uses few variables for prediction
- Drawback:It doesnt use income which might be a strong predictor
- Both accuracy are same so if the objective is to create a simple yet powerful system, then we should go for m2 otherwise m1
Step 8: Pruning of trees
Pruning/trimming of a decision tree is basically done to avoid overfitting. Overfitting is a issue that happens when the algorithm continues to fit on the training dataset.This results in an improved accuracy on the train dataset but the model performs badly(in terms of accuracy) on the test dataset.
Pruning helps to reduce overfitting by reducing the growth of the tree before it classifies the data to a record level.This means that if the terminal leaves of the tree are equal to the number of records, it means that the model has been overfitted.
Overfitting can be removed by using the following things:
- Pruning the tree
- Removing the variable(s) that is actually an identifier/key which has been mistakenly added to the model
- Removing a variable that has large number of unique values
Lets see how we can prune the decision tree based on our dataset
pfit<- prune(m1, cp= m1$cptable[which.min(m1$cptable[,"xerror"]),"CP"])
# plot the pruned tree
rpart.plot(pfit, type = 3)
We can see that the pruned tree is similar to the one obtained through model m2(using rpart.control argument)
Final Comments
In this blog we saw a practical applciation of Decision Tree in the credit risk domain.We learnt about its parameters, overfitting, pruning and so on.We will continue to explore further and discuss the drawbacks of decision tree and how random forest can be used to overcome them in the coming blogs
Link to Youtube Channel
https://www.youtube.com/playlist?list=PL6ruVo_0cYpV2Otl1jR9iDe6jt1jLXgAq
No comments:
Post a Comment