Sunday, July 5, 2020

Blog 29: Logistic Regression in R

Logistic Regression


Introduction

Logistic Regression is one of the most widely used Statistical models and remains as the most sought after algorithms because of its close resemblance to linear regression and ease of interpretation of results.It finds application in almost all the domain ranging from retail,manufacturing,finance, healthcare, etc. In this blog we will look at a credit default problem and the step and step approach of executing logistic regression in R.


Step 1: Read Data

The Default dataset is part of ISLR package. Packages such as ROCR and plotROC are imported to calculate certain evaluation metrics post modelling exercise. Lets import the dataset and look at the first few records

package.name<-c("dplyr","tidyr","stats","ISLR","ggplot2","devtools","corrplot","ROCR",
                "plotROC","e1071","officer","flextable")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

# stats library has the regression functions

# ISLR package has the credit Default dataset
data(Default)
df<-Default
head(df)
  default student   balance    income
1      No      No  729.5265 44361.625
2      No     Yes  817.1804 12106.135
3      No      No 1073.5492 31767.139
4      No      No  529.2506 35704.494
5      No      No  785.6559 38463.496
6      No     Yes  919.5885  7491.559

It is a simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt. The brief description of the columns are given below:

  • default:A categorical variable with levels No and Yes indicating whether the customer defaulted on their debt
  • student:A categorical variable with levels No and Yes indicating whether the customer is a student
  • balance :The average balance that the customer has remaining on their credit card after making their monthly payment
  • income:Income of customer


Step 2: Exploratory Data Analyses

Here we will try to analyse the following:

  • Unique values in the ‘default’ and ‘Student’ variable
  • Mean value of ‘balance’ and ‘income’ variable across unique values of ‘default’ variable


2a: Unique values in the ‘default’ variable
table(df[["default"]])

  No  Yes 
9667  333 

Proportion of default cases(default=Yes) is 333/(9667+333) ~ 3% This is typically the case for any classification problem where the value of interest is present in a very small percentage(such as in loan default, anomalous transactions cases)


2b: Unique values in the ‘student’ variable
table(df[["student"]])

  No  Yes 
7056 2944 


2c: Mean value of ‘balance’ and ‘income’ variable across unique values of ‘default’ variable
eda.df<-df%>%
  group_by(default)%>%
  summarise(TotalRecords=n(),
            Balance=mean(balance),
            Income=mean(income))%>%
  gather(key='Variable',value='Mean',Balance:Income)


lapply(c("Balance","Income"),function(x){


p<-ggplot(data=eda.df%>%
     as.data.frame()%>%
         filter(Variable==x), aes(x=default, y=Mean)) +
    geom_bar(stat="identity",fill = "orange")+
    xlab("default")+ylab(paste0("Mean Value of ",x))

return(p)

})
[[1]]


[[2]]

We can see that there is not much difference between mean levels of Income. On the other hand, there is a significant difference in terms of Balance variable. So we would expect the income attribute to impact default


Step 3: Splitting into Train(70%) and Test(30%) Dataset

One should note that the ratio of Yes and No in the default variable should remain almost the same in the test and train datasets.

set.seed(1)
smpl<-sample(nrow(df),0.70*nrow(df),F)
train.df<-df[smpl,]
test.df<-df[-smpl,]

# Poportion in Train Data set
table(train.df$default)[2]/nrow(train.df)
  Yes 
0.033 
# Poportion in Test Data set
table(train.df$default)[2]/nrow(train.df)
  Yes 
0.033 

We can see that the proportion of Yes in both the test and train dataset in same. Hence these two datasets have been created correctly


Step 4: Training Model

m1<-glm(default ~ ., data = train.df, family = "binomial")
summary(m1)

Call:
glm(formula = default ~ ., family = "binomial", data = train.df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4482  -0.1374  -0.0540  -0.0202   3.4027  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.095e+01  5.880e-01 -18.616   <2e-16 ***
studentYes  -7.167e-01  2.886e-01  -2.483    0.013 *  
balance      5.678e-03  2.741e-04  20.715   <2e-16 ***
income       6.273e-06  9.845e-06   0.637    0.524    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2030.3  on 6999  degrees of freedom
Residual deviance: 1073.5  on 6996  degrees of freedom
AIC: 1081.5

Number of Fisher Scoring iterations: 8


Step 5: Getting the optimal threshold value

Normally the aim of all model building exercise is to maximise accuracy. However, we need to be careful about how many False positive(FP) and False negative(FN) cases are generated by the model. In such circumstances, we would want to minimise FN and FP. This can be done by selection of an appropriate threshold value which is obtained by plotting the Sensitivity and Specificity plots and taking the intersection of these graphs as the optimal point. The logic behind this approach is that the point of intersection of these graphs represents the maximum value for both Sensitivity as well as Specificity which mimimises FN as well as FP

# Calcualting the predicted values
train.predict<-predict(m1,train.df,type="response")

# Creating a prediction object and then using functions used for slot
pred_val <-prediction(train.predict ,train.df[["default"]])


# Setting the sensitivity data frame
sens.perf <- performance(pred_val, "sens")
sens.values<-sens.perf@y.values
sens.df<-data.frame(C1=sens.perf@x.values,
                    C2=sens.values,
                    C3="Sensitivity")

colnames(sens.df)<-c("ProbValues","MetricValue","Type")

# Setting the sensitivity data frame
spec.perf<-performance(pred_val, "spec")
spec.values<-spec.perf@y.values
spec.df<-data.frame(C1=spec.perf@x.values,
                    C2=spec.values,
                    C3="Specifisity")

colnames(spec.df)<-c("ProbValues","MetricValue","Type")

# Combining both the data frame
cutoff.df<-rbind.data.frame(sens.df,spec.df)%>%
  filter(ProbValues != "Inf")%>%
  filter(MetricValue !="Inf")
sens.df %>% ggplot(aes(ProbValues,MetricValue)) + 
  geom_line(linetype = "solid",color="blue",lwd = 1.5) + 
  geom_line(data=spec.df, aes(ProbValues,MetricValue,col="red"),color="orange",lwd = 1.5) +
  scale_y_continuous(sec.axis = sec_axis(~., name = "Specificity")) +
  labs(x='Cutoff', y="Sensitivity") +
  theme(axis.title.y.right = element_text(colour = "orange"), legend.position="none") +
  theme(axis.title.y.left = element_text(colour = "blue"), legend.position="none")

The point of intersections comes at around 0.04. Hence the cut-off value of probability will be 0.04.Now we will calculate the evaluation metrics such as accuracy, sensitivity etc on test data

Step 6: Generating the Confusion Matrix

Based on the cut-off value of probability score(0.05), let’s generate the confusion matrix for test data

# Recoding predicted values based on cutoff
cutoff<-0.04

# Calcualting the predicted values on test data set
test.predict<-predict(m1,test.df,type="response")

test.predict.levels<-ifelse(test.predict > cutoff,"Yes","No")
test.predict.levels<-as.factor(test.predict.levels)

# Creating the table
xtab <- table(test.predict.levels, test.df[["default"]])
xtab
                   
test.predict.levels   No  Yes
                No  2585   17
                Yes  313   85
# Lets calculate the accuracy
(2556+85)/nrow(test.df) 
[1] 0.8803333

It comes out to be around 89% which is pretty good


Step 7: Calculating Sensitivity, Spesivity, Precision, Recall and F-Score based on confusion matrix

Before calculating the above metrics, lets briefly look at their definition and importance

Predicted

Actual

No

Yes

No

A

C

Yes

B

D


The abve table can also be represented in terms of true positive(TP), true negative(TN), false positive(FP) and false negative(FN) as shown below


Predicted

Actual

No

Yes

No

TN

FP

Yes

FN

TP


  • Accuracy: (TN+TP)/(TN+TP+FN+FP)

  • Sensitivity:Proportion of positive correctly classified.Also called true positive rate
    TP/(TP+FN)

  • Specificity:Proportion of Negatives correctly classified.Also called true negative rate
    TN/(TN+FP)

  • Precision: Proportion of positive cases predicted accurately by the model
    TP/(TP+FP)

  • Recall:Same as sensitivity

  • F1 Score: (2 X Precision X Recall)/(Precision+Recall) Importnat in cases where we want to shorlist best model among a set of competing models

Based on the above description, lets calculate these metrics for our present problem

# Creating the table
xtab <- as.matrix(table(test.predict.levels, test.df[["default"]]))
xtab
                   
test.predict.levels   No  Yes
                No  2585   17
                Yes  313   85
# Sensitivity: TP/TP+FN
sens<-xtab[2,2]/(xtab[2,2]+xtab[2,1])
sens
[1] 0.2135678
# Specificity:TN/(TN+FP)
spec<-xtab[1,1]/(xtab[1,1]+xtab[1,2])
spec
[1] 0.9934666
# Precision: TP/(TP+FN)
prec<-xtab[2,2]/(xtab[2,2]+xtab[1,2])
prec
[1] 0.8333333
# Recall: Same as Sensitivity which is 0.19

# F Score: (2 X Precision X Recall)/(Precision+Recall)
2*sens*spec/(sens+spec)
[1] 0.35156


Step 7: Plotting the Receiver Operating Characteristics(ROC) and Calculating Area under the curve(AUC)

Receiver Operating Characteristics(ROC) curve is an industry wide standard method to assess the performance of the two level classifier model. It is a plot against true positive rate(sensitivity) and false negative rate(1-specificity). A model with high discrimination ability will have high sensitivity and specificity simultaneously, resulting in an ROC curve that goes close to the top left corner of the plot. A model with no discriminatory power will have an ROC curve with a 45 degree diagnal line

Area under the curve(AUC): One way to assess the ability of the model to differentiate between yes and no is to report the AUC. As already mentioned, a model with high discriminatory power will be close to the left corner and hence have a high AUC. The following rules should be followed to derive insights from AUC:

  • AUC < 0.5: Model with no differentiation power
  • 0.5 =< AUC < 0.7: Model with good differentiation power
  • 0.7 =< AUC: Model with best differentiation power

With the above rules in mind, we will plot the ROC and calcualte the AUC for our model

# Plot the ROC curve

baseline.df<-data.frame(x=c(0,0.4,0.8,1),
                        y=c(0,0.4,0.8,1))

perf_val2 <- performance(pred_val, "tpr", "fpr")
improvement.df<-data.frame(x=perf_val2@x.values,
                           y=perf_val2@y.values)
colnames(improvement.df)<-c("x","y")


  ggplot(data=baseline.df, aes(x=x, y=y, group=1)) +
  geom_line(linetype = "dashed")+
  geom_line(data=improvement.df, aes(x,y,col="red"),color="orange",lwd = 1.5)+labs(x='Cutoff', y="Sensitivity")+   labs(x='FPR', y="TPR")

# Calculating Area under Curve(AUC)
pred_val <-prediction(train.predict ,train.df[["default"]])
perf_val <- performance(pred_val,"auc")
auc <- as.numeric(perf_val@y.values)
auc
[1] 0.9545362

The X axis is labelled as FPR which is false positive rate. 1- Specificity results in the following calculations:

  • 1- Specificity: 1-[TN/(TN+FP)]
  • [(TN+FP-TN)/(TN+FP)]
  • FP/(TN+FP)
  • Hence false positive rate
    Since the auc value is greater than 0.5, hence the model fit is appropriate and it acts as a good classifier


Step 8: Plotting gain and lift chart

Cumulative gains and lift chart are mostly relevant in Marketing scenarios where marketers are trying to reach target audience with minimum expenditure.They are a graphical representation of the advantage of using a predictive model to choose which customers to contact. The lift chart shows how much more likely we are to receive respondents than if we contact a random sample of customers. For example, by contacting only 10% of customers based on the predictive model we will reach 3 times as many respondents as if we use no model.

They loose their relevance in credit default models and hence have not been plotted here


Final Comments

In this blog we discussed step by step approach of performing logistic regression in R on a credit default problem. The steps shown above are applicable to any binary classficiation problem and hence are valid for Decistion trees, Random Forest, Naive Bayes classifier etc.This should act as a solid background to understand the nitty gritties of logistic regression’s implementation in R


Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...