Wednesday, April 1, 2020

Blog 20: Pickling in R

Pickling


Introduction

In every modelling exercise, we train a model and then evaluate it on test data set. Many of us would have come across situtions where we had built a model in the previous quarter and now there needs to be a model refresh based on latest batch of data. In this blog we will look at how to deal with such a scenario. We will use saveRDS and readRDS which are similar to pickling functions from python. We will look at cigarette consumption data and regress sales on cost.


Installing the library: dplyr,tidyr and Ecdat package

if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(dplyr)
}

if(!require("tidyr")){
  
  install.packages("tidyr")
}else{
  
  library(tidyr)
}

# For downloading the Cigarette Data
if(!require("Ecdat")){
  
  install.packages("Ecdat")
}else{
  
  library(Ecdat)
}

data(Cigar)
df<-Cigar
colnames(df)
[1] "state" "year"  "price" "pop"   "pop16" "cpi"   "ndi"   "sales" "pimin"


Running a Regression Model

model<-lm(sales ~ price,df)
summary(model)

Call:
lm(formula = sales ~ price, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-70.680 -15.211  -3.646   8.626 164.759 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 139.7345     1.5213   91.85   <2e-16 ***
price        -0.2298     0.0189  -12.16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.46 on 1378 degrees of freedom
Multiple R-squared:  0.09688,   Adjusted R-squared:  0.09623 
F-statistic: 147.8 on 1 and 1378 DF,  p-value: < 2.2e-16

Lets now save it to working directory using saveRDS


Saving the Model for later recall

saveRDS(model,"model.rds")


Reading the saved model using readRDS

m1<-readRDS("model.rds")
summary(m1)

Call:
lm(formula = sales ~ price, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-70.680 -15.211  -3.646   8.626 164.759 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 139.7345     1.5213   91.85   <2e-16 ***
price        -0.2298     0.0189  -12.16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.46 on 1378 degrees of freedom
Multiple R-squared:  0.09688,   Adjusted R-squared:  0.09623 
F-statistic: 147.8 on 1 and 1378 DF,  p-value: < 2.2e-16


m1 can now be used to make predictions on the incoming data set

Final Comments

This is a useful hack that helps us save model results and then use them later during model refresh phase.This is also a preferred way to share model results with stakeholders


No comments:

Post a Comment

Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...