Logistic Regression
Parag Verma
Introduction
Logistic Regression is one of the most widely used Statistical models and remains as the most sought after algorithms because of its close resemblance to linear regression and ease of interpretation of results.It finds application in almost all the domain ranging from retail,manufacturing,finance, healthcare, etc. In this blog we will look at a credit default problem and the step and step approach of executing logistic regression in R.
Step 1: Read Data
The Default dataset is part of ISLR package. Packages such as ROCR and plotROC are imported to calculate certain evaluation metrics post modelling exercise. Lets import the dataset and look at the first few records
package.name<-c("dplyr","tidyr","stats","ISLR","ggplot2","devtools","corrplot","ROCR",
"plotROC","e1071","officer","flextable")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
# stats library has the regression functions
# ISLR package has the credit Default dataset
data(Default)
df<-Default
head(df)
default student balance income
1 No No 729.5265 44361.625
2 No Yes 817.1804 12106.135
3 No No 1073.5492 31767.139
4 No No 529.2506 35704.494
5 No No 785.6559 38463.496
6 No Yes 919.5885 7491.559
It is a simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt. The brief description of the columns are given below:
- default:A categorical variable with levels No and Yes indicating whether the customer defaulted on their debt
- student:A categorical variable with levels No and Yes indicating whether the customer is a student
- balance :The average balance that the customer has remaining on their credit card after making their monthly payment
- income:Income of customer
Step 2: Exploratory Data Analyses
Here we will try to analyse the following:
- Unique values in the ‘default’ and ‘Student’ variable
- Mean value of ‘balance’ and ‘income’ variable across unique values of ‘default’ variable
2a: Unique values in the ‘default’ variable
table(df[["default"]])
No Yes
9667 333
Proportion of default cases(default=Yes) is 333/(9667+333) ~ 3% This is typically the case for any classification problem where the value of interest is present in a very small percentage(such as in loan default, anomalous transactions cases)
2b: Unique values in the ‘student’ variable
table(df[["student"]])
No Yes
7056 2944
2c: Mean value of ‘balance’ and ‘income’ variable across unique values of ‘default’ variable
eda.df<-df%>%
group_by(default)%>%
summarise(TotalRecords=n(),
Balance=mean(balance),
Income=mean(income))%>%
gather(key='Variable',value='Mean',Balance:Income)
lapply(c("Balance","Income"),function(x){
p<-ggplot(data=eda.df%>%
as.data.frame()%>%
filter(Variable==x), aes(x=default, y=Mean)) +
geom_bar(stat="identity",fill = "orange")+
xlab("default")+ylab(paste0("Mean Value of ",x))
return(p)
})
[[1]]
[[2]]
We can see that there is not much difference between mean levels of Income. On the other hand, there is a significant difference in terms of Balance variable. So we would expect the income attribute to impact default
Step 3: Splitting into Train(70%) and Test(30%) Dataset
One should note that the ratio of Yes and No in the default variable should remain almost the same in the test and train datasets.
set.seed(1)
smpl<-sample(nrow(df),0.70*nrow(df),F)
train.df<-df[smpl,]
test.df<-df[-smpl,]
# Poportion in Train Data set
table(train.df$default)[2]/nrow(train.df)
Yes
0.033
# Poportion in Test Data set
table(train.df$default)[2]/nrow(train.df)
Yes
0.033
We can see that the proportion of Yes in both the test and train dataset in same. Hence these two datasets have been created correctly
Step 4: Training Model
m1<-glm(default ~ ., data = train.df, family = "binomial")
summary(m1)
Call:
glm(formula = default ~ ., family = "binomial", data = train.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4482 -0.1374 -0.0540 -0.0202 3.4027
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.095e+01 5.880e-01 -18.616 <2e-16 ***
studentYes -7.167e-01 2.886e-01 -2.483 0.013 *
balance 5.678e-03 2.741e-04 20.715 <2e-16 ***
income 6.273e-06 9.845e-06 0.637 0.524
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2030.3 on 6999 degrees of freedom
Residual deviance: 1073.5 on 6996 degrees of freedom
AIC: 1081.5
Number of Fisher Scoring iterations: 8
Step 5: Getting the optimal threshold value
Normally the aim of all model building exercise is to maximise accuracy. However, we need to be careful about how many False positive(FP) and False negative(FN) cases are generated by the model. In such circumstances, we would want to minimise FN and FP. This can be done by selection of an appropriate threshold value which is obtained by plotting the Sensitivity and Specificity plots and taking the intersection of these graphs as the optimal point. The logic behind this approach is that the point of intersection of these graphs represents the maximum value for both Sensitivity as well as Specificity which mimimises FN as well as FP
# Calcualting the predicted values
train.predict<-predict(m1,train.df,type="response")
# Creating a prediction object and then using functions used for slot
pred_val <-prediction(train.predict ,train.df[["default"]])
# Setting the sensitivity data frame
sens.perf <- performance(pred_val, "sens")
sens.values<-sens.perf@y.values
sens.df<-data.frame(C1=sens.perf@x.values,
C2=sens.values,
C3="Sensitivity")
colnames(sens.df)<-c("ProbValues","MetricValue","Type")
# Setting the sensitivity data frame
spec.perf<-performance(pred_val, "spec")
spec.values<-spec.perf@y.values
spec.df<-data.frame(C1=spec.perf@x.values,
C2=spec.values,
C3="Specifisity")
colnames(spec.df)<-c("ProbValues","MetricValue","Type")
# Combining both the data frame
cutoff.df<-rbind.data.frame(sens.df,spec.df)%>%
filter(ProbValues != "Inf")%>%
filter(MetricValue !="Inf")
sens.df %>% ggplot(aes(ProbValues,MetricValue)) +
geom_line(linetype = "solid",color="blue",lwd = 1.5) +
geom_line(data=spec.df, aes(ProbValues,MetricValue,col="red"),color="orange",lwd = 1.5) +
scale_y_continuous(sec.axis = sec_axis(~., name = "Specificity")) +
labs(x='Cutoff', y="Sensitivity") +
theme(axis.title.y.right = element_text(colour = "orange"), legend.position="none") +
theme(axis.title.y.left = element_text(colour = "blue"), legend.position="none")
The point of intersections comes at around 0.04. Hence the cut-off value of probability will be 0.04.Now we will calculate the evaluation metrics such as accuracy, sensitivity etc on test data
Step 6: Generating the Confusion Matrix
Based on the cut-off value of probability score(0.05), let’s generate the confusion matrix for test data
# Recoding predicted values based on cutoff
cutoff<-0.04
# Calcualting the predicted values on test data set
test.predict<-predict(m1,test.df,type="response")
test.predict.levels<-ifelse(test.predict > cutoff,"Yes","No")
test.predict.levels<-as.factor(test.predict.levels)
# Creating the table
xtab <- table(test.predict.levels, test.df[["default"]])
xtab
test.predict.levels No Yes
No 2585 17
Yes 313 85
# Lets calculate the accuracy
(2556+85)/nrow(test.df)
[1] 0.8803333
It comes out to be around 89% which is pretty good
Step 7: Calculating Sensitivity, Spesivity, Precision, Recall and F-Score based on confusion matrix
Before calculating the above metrics, lets briefly look at their definition and importance
Predicted | ||
Actual | No | Yes |
No | A | C |
Yes | B | D |
The abve table can also be represented in terms of true positive(TP), true negative(TN), false positive(FP) and false negative(FN) as shown below
Predicted | ||
Actual | No | Yes |
No | TN | FP |
Yes | FN | TP |
Accuracy: (TN+TP)/(TN+TP+FN+FP)
Sensitivity:Proportion of positive correctly classified.Also called true positive rate
TP/(TP+FN)Specificity:Proportion of Negatives correctly classified.Also called true negative rate
TN/(TN+FP)Precision: Proportion of positive cases predicted accurately by the model
TP/(TP+FP)Recall:Same as sensitivity
F1 Score: (2 X Precision X Recall)/(Precision+Recall) Importnat in cases where we want to shorlist best model among a set of competing models
Based on the above description, lets calculate these metrics for our present problem
# Creating the table
xtab <- as.matrix(table(test.predict.levels, test.df[["default"]]))
xtab
test.predict.levels No Yes
No 2585 17
Yes 313 85
# Sensitivity: TP/TP+FN
sens<-xtab[2,2]/(xtab[2,2]+xtab[2,1])
sens
[1] 0.2135678
# Specificity:TN/(TN+FP)
spec<-xtab[1,1]/(xtab[1,1]+xtab[1,2])
spec
[1] 0.9934666
# Precision: TP/(TP+FN)
prec<-xtab[2,2]/(xtab[2,2]+xtab[1,2])
prec
[1] 0.8333333
# Recall: Same as Sensitivity which is 0.19
# F Score: (2 X Precision X Recall)/(Precision+Recall)
2*sens*spec/(sens+spec)
[1] 0.35156
Step 7: Plotting the Receiver Operating Characteristics(ROC) and Calculating Area under the curve(AUC)
Receiver Operating Characteristics(ROC) curve is an industry wide standard method to assess the performance of the two level classifier model. It is a plot against true positive rate(sensitivity) and false negative rate(1-specificity). A model with high discrimination ability will have high sensitivity and specificity simultaneously, resulting in an ROC curve that goes close to the top left corner of the plot. A model with no discriminatory power will have an ROC curve with a 45 degree diagnal line
Area under the curve(AUC): One way to assess the ability of the model to differentiate between yes and no is to report the AUC. As already mentioned, a model with high discriminatory power will be close to the left corner and hence have a high AUC. The following rules should be followed to derive insights from AUC:
- AUC < 0.5: Model with no differentiation power
- 0.5 =< AUC < 0.7: Model with good differentiation power
- 0.7 =< AUC: Model with best differentiation power
With the above rules in mind, we will plot the ROC and calcualte the AUC for our model
# Plot the ROC curve
baseline.df<-data.frame(x=c(0,0.4,0.8,1),
y=c(0,0.4,0.8,1))
perf_val2 <- performance(pred_val, "tpr", "fpr")
improvement.df<-data.frame(x=perf_val2@x.values,
y=perf_val2@y.values)
colnames(improvement.df)<-c("x","y")
ggplot(data=baseline.df, aes(x=x, y=y, group=1)) +
geom_line(linetype = "dashed")+
geom_line(data=improvement.df, aes(x,y,col="red"),color="orange",lwd = 1.5)+labs(x='Cutoff', y="Sensitivity")+ labs(x='FPR', y="TPR")
# Calculating Area under Curve(AUC)
pred_val <-prediction(train.predict ,train.df[["default"]])
perf_val <- performance(pred_val,"auc")
auc <- as.numeric(perf_val@y.values)
auc
[1] 0.9545362
The X axis is labelled as FPR which is false positive rate. 1- Specificity results in the following calculations:
- 1- Specificity: 1-[TN/(TN+FP)]
- [(TN+FP-TN)/(TN+FP)]
- FP/(TN+FP)
- Hence false positive rate
Since the auc value is greater than 0.5, hence the model fit is appropriate and it acts as a good classifier
Step 8: Plotting gain and lift chart
Cumulative gains and lift chart are mostly relevant in Marketing scenarios where marketers are trying to reach target audience with minimum expenditure.They are a graphical representation of the advantage of using a predictive model to choose which customers to contact. The lift chart shows how much more likely we are to receive respondents than if we contact a random sample of customers. For example, by contacting only 10% of customers based on the predictive model we will reach 3 times as many respondents as if we use no model.
They loose their relevance in credit default models and hence have not been plotted here
Final Comments
In this blog we discussed step by step approach of performing logistic regression in R on a credit default problem. The steps shown above are applicable to any binary classficiation problem and hence are valid for Decistion trees, Random Forest, Naive Bayes classifier etc.This should act as a solid background to understand the nitty gritties of logistic regression’s implementation in R
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html
How to treat mixed variables in supervised learning, does scaling of variables improve accuracy?
ReplyDelete