Machine Learning Made Easy

Logistic Regression

Introduction

Logistic Regression is one of the most widely used Statistical models and remains as the most sought after algorithms because of its close resemblance to linear regression and ease of interpretation of results.It finds application in almost all the domain ranging from retail,manufacturing,finance, healthcare, etc. In this blog we will look at a credit default problem and the step and step approach of executing logistic regression in R.

Step 1: Read Data

The Default dataset is part of ISLR package. Packages such as ROCR and plotROC are imported to calculate certain evaluation metrics post modelling exercise. Lets import the dataset and look at the first few records

package.name<-c("dplyr","tidyr","stats","ISLR","ggplot2","devtools","corrplot","ROCR",
                "plotROC","e1071","officer","flextable")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

# stats library has the regression functions

# ISLR package has the credit Default dataset
data(Default)
df<-Default
head(df)

  default student   balance    income
1      No      No  729.5265 44361.625
2      No     Yes  817.1804 12106.135
3      No      No 1073.5492 31767.139
4      No      No  529.2506 35704.494
5      No      No  785.6559 38463.496
6      No     Yes  919.5885  7491.559

It is a simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt. The brief description of the columns are given below:

default:A categorical variable with levels No and Yes indicating whether the customer defaulted on their debt
student:A categorical variable with levels No and Yes indicating whether the customer is a student
balance :The average balance that the customer has remaining on their credit card after making their monthly payment
income:Income of customer

Step 2: Exploratory Data Analyses

Here we will try to analyse the following:

Unique values in the ‘default’ and ‘Student’ variable
Mean value of ‘balance’ and ‘income’ variable across unique values of ‘default’ variable

2a: Unique values in the ‘default’ variable

table(df[["default"]])


  No  Yes 
9667  333

Proportion of default cases(default=Yes) is 333/(9667+333) ~ 3% This is typically the case for any classification problem where the value of interest is present in a very small percentage(such as in loan default, anomalous transactions cases)

2b: Unique values in the ‘student’ variable

table(df[["student"]])


  No  Yes 
7056 2944

2c: Mean value of ‘balance’ and ‘income’ variable across unique values of ‘default’ variable

eda.df<-df%>%
  group_by(default)%>%
  summarise(TotalRecords=n(),
            Balance=mean(balance),
            Income=mean(income))%>%
  gather(key='Variable',value='Mean',Balance:Income)


lapply(c("Balance","Income"),function(x){


p<-ggplot(data=eda.df%>%
     as.data.frame()%>%
         filter(Variable==x), aes(x=default, y=Mean)) +
    geom_bar(stat="identity",fill = "orange")+
    xlab("default")+ylab(paste0("Mean Value of ",x))

return(p)

})

[[1]]


[[2]]

We can see that there is not much difference between mean levels of Income. On the other hand, there is a significant difference in terms of Balance variable. So we would expect the income attribute to impact default

Step 3: Splitting into Train(70%) and Test(30%) Dataset

One should note that the ratio of Yes and No in the default variable should remain almost the same in the test and train datasets.

set.seed(1)
smpl<-sample(nrow(df),0.70*nrow(df),F)
train.df<-df[smpl,]
test.df<-df[-smpl,]

# Poportion in Train Data set
table(train.df$default)[2]/nrow(train.df)

  Yes 
0.033

# Poportion in Test Data set
table(train.df$default)[2]/nrow(train.df)

  Yes 
0.033

We can see that the proportion of Yes in both the test and train dataset in same. Hence these two datasets have been created correctly

Step 4: Training Model

m1<-glm(default ~ ., data = train.df, family = "binomial")
summary(m1)


Call:
glm(formula = default ~ ., family = "binomial", data = train.df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4482  -0.1374  -0.0540  -0.0202   3.4027  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.095e+01  5.880e-01 -18.616   <2e-16 ***
studentYes  -7.167e-01  2.886e-01  -2.483    0.013 *  
balance      5.678e-03  2.741e-04  20.715   <2e-16 ***
income       6.273e-06  9.845e-06   0.637    0.524    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2030.3  on 6999  degrees of freedom
Residual deviance: 1073.5  on 6996  degrees of freedom
AIC: 1081.5

Number of Fisher Scoring iterations: 8

Step 5: Getting the optimal threshold value

Normally the aim of all model building exercise is to maximise accuracy. However, we need to be careful about how many False positive(FP) and False negative(FN) cases are generated by the model. In such circumstances, we would want to minimise FN and FP. This can be done by selection of an appropriate threshold value which is obtained by plotting the Sensitivity and Specificity plots and taking the intersection of these graphs as the optimal point. The logic behind this approach is that the point of intersection of these graphs represents the maximum value for both Sensitivity as well as Specificity which mimimises FN as well as FP

# Calcualting the predicted values
train.predict<-predict(m1,train.df,type="response")

# Creating a prediction object and then using functions used for slot
pred_val <-prediction(train.predict ,train.df[["default"]])


# Setting the sensitivity data frame
sens.perf <- performance(pred_val, "sens")
sens.values<-sens.perf@y.values
sens.df<-data.frame(C1=sens.perf@x.values,
                    C2=sens.values,
                    C3="Sensitivity")

colnames(sens.df)<-c("ProbValues","MetricValue","Type")

# Setting the sensitivity data frame
spec.perf<-performance(pred_val, "spec")
spec.values<-spec.perf@y.values
spec.df<-data.frame(C1=spec.perf@x.values,
                    C2=spec.values,
                    C3="Specifisity")

colnames(spec.df)<-c("ProbValues","MetricValue","Type")

# Combining both the data frame
cutoff.df<-rbind.data.frame(sens.df,spec.df)%>%
  filter(ProbValues != "Inf")%>%
  filter(MetricValue !="Inf")

sens.df %>% ggplot(aes(ProbValues,MetricValue)) + 
  geom_line(linetype = "solid",color="blue",lwd = 1.5) + 
  geom_line(data=spec.df, aes(ProbValues,MetricValue,col="red"),color="orange",lwd = 1.5) +
  scale_y_continuous(sec.axis = sec_axis(~., name = "Specificity")) +
  labs(x='Cutoff', y="Sensitivity") +
  theme(axis.title.y.right = element_text(colour = "orange"), legend.position="none") +
  theme(axis.title.y.left = element_text(colour = "blue"), legend.position="none")

The point of intersections comes at around 0.04. Hence the cut-off value of probability will be 0.04.Now we will calculate the evaluation metrics such as accuracy, sensitivity etc on test data

Step 6: Generating the Confusion Matrix

Based on the cut-off value of probability score(0.05), let’s generate the confusion matrix for test data

# Recoding predicted values based on cutoff
cutoff<-0.04

# Calcualting the predicted values on test data set
test.predict<-predict(m1,test.df,type="response")

test.predict.levels<-ifelse(test.predict > cutoff,"Yes","No")
test.predict.levels<-as.factor(test.predict.levels)

# Creating the table
xtab <- table(test.predict.levels, test.df[["default"]])
xtab

                   
test.predict.levels   No  Yes
                No  2585   17
                Yes  313   85

# Lets calculate the accuracy
(2556+85)/nrow(test.df)

[1] 0.8803333

It comes out to be around 89% which is pretty good

Step 7: Calculating Sensitivity, Spesivity, Precision, Recall and F-Score based on confusion matrix

Before calculating the above metrics, lets briefly look at their definition and importance

Predicted
Actual	No	Yes
No	A	C
Yes	B	D

The abve table can also be represented in terms of true positive(TP), true negative(TN), false positive(FP) and false negative(FN) as shown below

Predicted
Actual	No	Yes
No	TN	FP
Yes	FN	TP

Accuracy: (TN+TP)/(TN+TP+FN+FP)
Sensitivity:Proportion of positive correctly classified.Also called true positive rate
TP/(TP+FN)
Specificity:Proportion of Negatives correctly classified.Also called true negative rate
TN/(TN+FP)
Precision: Proportion of positive cases predicted accurately by the model
TP/(TP+FP)
Recall:Same as sensitivity
F1 Score: (2 X Precision X Recall)/(Precision+Recall) Importnat in cases where we want to shorlist best model among a set of competing models

Based on the above description, lets calculate these metrics for our present problem

# Creating the table
xtab <- as.matrix(table(test.predict.levels, test.df[["default"]]))
xtab

                   
test.predict.levels   No  Yes
                No  2585   17
                Yes  313   85

# Sensitivity: TP/TP+FN
sens<-xtab[2,2]/(xtab[2,2]+xtab[2,1])
sens

[1] 0.2135678

# Specificity:TN/(TN+FP)
spec<-xtab[1,1]/(xtab[1,1]+xtab[1,2])
spec

[1] 0.9934666

# Precision: TP/(TP+FN)
prec<-xtab[2,2]/(xtab[2,2]+xtab[1,2])
prec

[1] 0.8333333

# Recall: Same as Sensitivity which is 0.19

# F Score: (2 X Precision X Recall)/(Precision+Recall)
2*sens*spec/(sens+spec)

[1] 0.35156

Step 7: Plotting the Receiver Operating Characteristics(ROC) and Calculating Area under the curve(AUC)

Receiver Operating Characteristics(ROC) curve is an industry wide standard method to assess the performance of the two level classifier model. It is a plot against true positive rate(sensitivity) and false negative rate(1-specificity). A model with high discrimination ability will have high sensitivity and specificity simultaneously, resulting in an ROC curve that goes close to the top left corner of the plot. A model with no discriminatory power will have an ROC curve with a 45 degree diagnal line

Area under the curve(AUC): One way to assess the ability of the model to differentiate between yes and no is to report the AUC. As already mentioned, a model with high discriminatory power will be close to the left corner and hence have a high AUC. The following rules should be followed to derive insights from AUC:

AUC < 0.5: Model with no differentiation power
0.5 =< AUC < 0.7: Model with good differentiation power
0.7 =< AUC: Model with best differentiation power

With the above rules in mind, we will plot the ROC and calcualte the AUC for our model

# Plot the ROC curve

baseline.df<-data.frame(x=c(0,0.4,0.8,1),
                        y=c(0,0.4,0.8,1))

perf_val2 <- performance(pred_val, "tpr", "fpr")
improvement.df<-data.frame(x=perf_val2@x.values,
                           y=perf_val2@y.values)
colnames(improvement.df)<-c("x","y")


  ggplot(data=baseline.df, aes(x=x, y=y, group=1)) +
  geom_line(linetype = "dashed")+
  geom_line(data=improvement.df, aes(x,y,col="red"),color="orange",lwd = 1.5)+labs(x='Cutoff', y="Sensitivity")+   labs(x='FPR', y="TPR")

# Calculating Area under Curve(AUC)
pred_val <-prediction(train.predict ,train.df[["default"]])
perf_val <- performance(pred_val,"auc")
auc <- as.numeric(perf_val@y.values)
auc

[1] 0.9545362

The X axis is labelled as FPR which is false positive rate. 1- Specificity results in the following calculations:

1- Specificity: 1-[TN/(TN+FP)]
[(TN+FP-TN)/(TN+FP)]
FP/(TN+FP)
Hence false positive rate
Since the auc value is greater than 0.5, hence the model fit is appropriate and it acts as a good classifier

Step 8: Plotting gain and lift chart

Cumulative gains and lift chart are mostly relevant in Marketing scenarios where marketers are trying to reach target audience with minimum expenditure.They are a graphical representation of the advantage of using a predictive model to choose which customers to contact. The lift chart shows how much more likely we are to receive respondents than if we contact a random sample of customers. For example, by contacting only 10% of customers based on the predictive model we will reach 3 times as many respondents as if we use no model.

They loose their relevance in credit default models and hence have not been plotted here

Final Comments

In this blog we discussed step by step approach of performing logistic regression in R on a credit default problem. The steps shown above are applicable to any binary classficiation problem and hence are valid for Decistion trees, Random Forest, Naive Bayes classifier etc.This should act as a solid background to understand the nitty gritties of logistic regression’s implementation in R

Link to Previous R Blogs

https://ml-withparag.com/

List of Datasets for Practise

https://hofmann.public.iastate.edu/data_in_r_sortable.html

https://vincentarelbundock.github.io/Rdatasets/datasets.html