Pickling
Parag Verma
Introduction
In every modelling exercise, we train a model and then evaluate it on test data set. Many of us would have come across situtions where we had built a model in the previous quarter and now there needs to be a model refresh based on latest batch of data. In this blog we will look at how to deal with such a scenario. We will use saveRDS and readRDS which are similar to pickling functions from python. We will look at cigarette consumption data and regress sales on cost.
Installing the library: dplyr,tidyr and Ecdat package
if(!require("dplyr")){
install.packages("dplyr")
}else{
library(dplyr)
}
if(!require("tidyr")){
install.packages("tidyr")
}else{
library(tidyr)
}
# For downloading the Cigarette Data
if(!require("Ecdat")){
install.packages("Ecdat")
}else{
library(Ecdat)
}
data(Cigar)
df<-Cigar
colnames(df)
[1] "state" "year" "price" "pop" "pop16" "cpi" "ndi" "sales" "pimin"
Running a Regression Model
model<-lm(sales ~ price,df)
summary(model)
Call:
lm(formula = sales ~ price, data = df)
Residuals:
Min 1Q Median 3Q Max
-70.680 -15.211 -3.646 8.626 164.759
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 139.7345 1.5213 91.85 <2e-16 ***
price -0.2298 0.0189 -12.16 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 29.46 on 1378 degrees of freedom
Multiple R-squared: 0.09688, Adjusted R-squared: 0.09623
F-statistic: 147.8 on 1 and 1378 DF, p-value: < 2.2e-16
Lets now save it to working directory using saveRDS
Saving the Model for later recall
saveRDS(model,"model.rds")
Reading the saved model using readRDS
m1<-readRDS("model.rds")
summary(m1)
Call:
lm(formula = sales ~ price, data = df)
Residuals:
Min 1Q Median 3Q Max
-70.680 -15.211 -3.646 8.626 164.759
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 139.7345 1.5213 91.85 <2e-16 ***
price -0.2298 0.0189 -12.16 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 29.46 on 1378 degrees of freedom
Multiple R-squared: 0.09688, Adjusted R-squared: 0.09623
F-statistic: 147.8 on 1 and 1378 DF, p-value: < 2.2e-16
m1 can now be used to make predictions on the incoming data set
Final Comments
This is a useful hack that helps us save model results and then use them later during model refresh phase.This is also a preferred way to share model results with stakeholders
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html
No comments:
Post a Comment