Demand Forecasting using Xgboost
Parag Verma
27th Dec, 2022
Introduction
All industries either sell product or services or both.In the case of products, there is always a constant question around how much to produce. Successful anticipation of demand helps to remove uncertainty in business operations and enables to streamline operations. Its impact on total revenue and profit is also profound as the process of predicting demand helps to remove guess work from daily operations thereby reducing cost and increasing service levels
To get a hang of things, we will look at how to leverage the data for liquor sales and create a forecasting model. We will be using xgboost model to forecast as it is more efficient than some of the classical models such as ARIMA, SARIMA, ARIMAX etc.
For our analysis, we will be using the liquor_sales data set from ialiquor library.
Step 0: Installing libraries
package.name<-c("dplyr","lubridate","openxlsx","tidyr",
"ialiquor","stringr",
"mlr", # for hyper parameter tuning
"xgboost")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
Step 1: Importing the dataset
Importing the liquor data set
data("liquor_sales")
df<-liquor_sales
head(df)
# A tibble: 6 × 10
year year_month county popul…¹ type categ…² state…³ state…⁴ bottl…⁵
<dbl> <dttm> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2015 2015-01-01 00:00:00 adair 7145 vodka 100 pr… 254. 381. 54
2 2015 2015-01-01 00:00:00 adair 7145 other americ… 54 81 6
3 2015 2015-01-01 00:00:00 adair 7145 liqu… americ… 89.0 134. 18
4 2015 2015-01-01 00:00:00 adair 7145 cock… americ… 182. 277. 26
5 2015 2015-01-01 00:00:00 adair 7145 liqu… americ… 346. 519. 36
6 2015 2015-01-01 00:00:00 adair 7145 gin americ… 99.6 149. 24
# … with 1 more variable: volume <dbl>, and abbreviated variable names
# ¹population, ²category, ³state_cost, ⁴state_revenue, ⁵bottles_sold
The attributes are as follows:
- year:Year of sales
- year_month: time of sale in year and month
- county:Name of the county
- population:Total population in county
- type:Type of liquor
- category:Category of liquor
- state_cost:Total cost to the state for preparing the liquor
- state_revenue:Total revenue obtained by selling the volume of liquor
- bottles_sold:Number of bottles sold
- volume:Total volume sold
For this blog, we will focus on predicting the total volume sold and try to create a model with rest of the variables available in the model
If you unable to import the data, then you can download the liquor_sales dataset form my github repo repository.
Step 2: Freq profiling
If you look at the data set, then there are only a few selected fields that we can leverage for forecasting. To predict volume for future months, we would need the future values of independent variables. Variables such as cost, revenue and population are non-deterministic and hence cant be taken into model building.Instead, we will focus on using time features such as year, month and categorical attributes such as county and type.
Lets look at the distribution of these variables using freq profiling
# Converting year month to categorical data for profiling purposes
df$year_month2<-str_c("_",df$year_month)
# Independent variables
nm<-c("year","year_month2","county","type","category")
# Dependent variables
dv<-"volume"
df.interim<-df%>%
select(c(nm,dv))
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(nm)
# Now:
data %>% select(all_of(nm))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(dv)
# Now:
data %>% select(all_of(dv))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
head(df.interim)
# A tibble: 6 × 6
year year_month2 county type category volume
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 2015 _2015-01-01 adair vodka 100 proof vodka 58.5
2 2015 _2015-01-01 adair other american alcohol 4.5
3 2015 _2015-01-01 adair liqueur american amaretto 19.5
4 2015 _2015-01-01 adair cocktail american cocktails 45.5
5 2015 _2015-01-01 adair liqueur american cordials & liqueur 27
6 2015 _2015-01-01 adair gin american dry gins 15
Looking the total number of records for each level of a variable along with total volume across it
l1<-lapply(nm,function(x){
z<-df%>%
select(x,dv)
colnames(z)[1]<-"Level"
z1<-z%>%
group_by(Level)%>%
summarise(Total=n(),
Total_Volume=sum(volume))%>%
ungroup()%>%
mutate(Feature=x)%>%
select(Feature,Level,everything())
return(z1)
})
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(x)
# Now:
data %>% select(all_of(x))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
df.final<-do.call(rbind.data.frame,l1)
head(df.final)
# A tibble: 6 × 4
Feature Level Total Total_Volume
<chr> <chr> <int> <dbl>
1 year 2015 59064 19609391.
2 year 2016 53129 19528006.
3 year 2017 43487 20670621.
4 year 2018 43830 21868138.
5 year 2019 43961 22262339.
6 year 2020 37265 19985800.
write.xlsx(df.final,"Freq_profile.xlsx")
Now if you pivot the data from the excel output and create the chart, it will somthing like the following charts
There is variation in total liquor consumption across years with the volume peaking around 2018 and 2019 periods. We should definitely take year into our model as it would be indicative of the trend in consumption
Monthly demand alternates through cycles of increase and decrease. More or less it settles around the 1700k mark as shown in the graph.This variable is also a good indicator of seasonality
Also if you look closely, the date values correspond to first of every month.Hence taking the week information from the year_month field will not make sense as the week that we get out of it will be like 1,5,9,13 and so on. Hence we can see that the weeks value will not be available for all the 52 weeks of a given year
Most the consumption is done in the top 10 counties.Hence for our model, we can just focus on the top 10 counties.
Whisky along with Vodka sells the most.In order ot develop the model for a given product, lets focus on Vodka initially.The modelling concepts that we develop for whiskey can be extended to vodka as well
Step 3: Derive Seasonality: An illustration
Seasonality is something that repeats regulary for each time period.In our case, it could be some pattern that repeats every year.Now we have already seen from the plot of year_month that the volume goes through a cycle of increase and decrease. What we intend to do here is to derive that pattern for every county across a single product(Vodka).For illustration, lets look at polk county and Vodka
polk county and Vodka
polk.df<-df%>%
select(year,year_month,county,category,volume)%>%
mutate(Month_Nm = month(year_month))%>% # extracting the month info
filter(county %in% c("polk"),
category=="american vodka")%>%
arrange(year,Month_Nm)%>%
filter(year != 2016) # removing 2016 data so that we start from Jan 2017
head(polk.df)
# A tibble: 6 × 6
year year_month county category volume Month_Nm
<dbl> <dttm> <chr> <chr> <dbl> <dbl>
1 2017 2017-01-01 00:00:00 polk american vodka 64247. 1
2 2017 2017-02-01 00:00:00 polk american vodka 73150. 2
3 2017 2017-03-01 00:00:00 polk american vodka 77927. 3
4 2017 2017-04-01 00:00:00 polk american vodka 81923. 4
5 2017 2017-05-01 00:00:00 polk american vodka 85994. 5
6 2017 2017-06-01 00:00:00 polk american vodka 102798. 6
We will be converting the volume into time series and then decompose the time series to get trend, seasonality and random components
# Converting volume into a time series
ts_val<-ts(polk.df$volume,frequency = 12)
# Decomposing into trend, seasonality and random components
plot(decompose(ts_val))
Observations from the Plot:
- The seasonality component repeats for each time period
- There is an upward trend in volume.We will use the year column as a proxy for trend
- Observed and random components we dont want in our model
# Extracting the seasonality component from the decomposed part
ts_seasonal<-as.numeric(decompose(ts_val)$seasonal)
# Normalizing the component so that it becomes a ratio instead of absolute values
seas_factor<-ts_seasonal/max(ts_seasonal)
seas_factor
[1] -0.7946703 -0.6616139 -0.3358038 -0.1644492 1.0000000 0.4525524
[7] 0.2120918 0.9767894 -0.6340173 0.0751423 0.1862966 -0.3123180
[13] -0.7946703 -0.6616139 -0.3358038 -0.1644492 1.0000000 0.4525524
[19] 0.2120918 0.9767894 -0.6340173 0.0751423 0.1862966 -0.3123180
[25] -0.7946703 -0.6616139 -0.3358038 -0.1644492 1.0000000 0.4525524
[31] 0.2120918 0.9767894 -0.6340173 0.0751423 0.1862966 -0.3123180
[37] -0.7946703 -0.6616139 -0.3358038 -0.1644492 1.0000000 0.4525524
[43] 0.2120918 0.9767894 -0.6340173 0.0751423
Adding the column into the data frame for polk
polk.df%>%
mutate(Seasonality=seas_factor)
# A tibble: 46 × 7
year year_month county category volume Month_Nm Seasonality
<dbl> <dttm> <chr> <chr> <dbl> <dbl> <dbl>
1 2017 2017-01-01 00:00:00 polk american vodka 64247. 1 -0.795
2 2017 2017-02-01 00:00:00 polk american vodka 73150. 2 -0.662
3 2017 2017-03-01 00:00:00 polk american vodka 77927. 3 -0.336
4 2017 2017-04-01 00:00:00 polk american vodka 81923. 4 -0.164
5 2017 2017-05-01 00:00:00 polk american vodka 85994. 5 1
6 2017 2017-06-01 00:00:00 polk american vodka 102798. 6 0.453
7 2017 2017-07-01 00:00:00 polk american vodka 84088. 7 0.212
8 2017 2017-08-01 00:00:00 polk american vodka 98003. 8 0.977
9 2017 2017-09-01 00:00:00 polk american vodka 81482. 9 -0.634
10 2017 2017-10-01 00:00:00 polk american vodka 89401. 10 0.0751
# … with 36 more rows
We saw how we were able to create the seasonality column for one county and product.Now the goal is to create seasonality column for more counties.
Step 4: Derive Seasonality for more counties
cnty<-c('polk','linn','scott',
'johnson','black hawk','woodbury',
'dubuque','story','pottawatta',
'dallas','cerro gord','dickinson',
'clinton','des moines','lee',
'webster','muscatine',
'marshall','warren','wapello')
# Running through data for various counties to get seasonality
l2<-list()
l2<-lapply(cnty,function(x){
county.df<-df%>%
select(year,year_month,county,category,volume)%>%
mutate(Month_Nm = month(year_month))%>% # extracting the month info
filter(county %in% x,
category=="american vodka")%>%
arrange(year,Month_Nm)%>%
filter(year != 2016) # removing 2016 data so that we start from Jan 2017
# Converting volume into a time series
ts_val<-ts(county.df$volume,frequency = 12)
# Decomposing into trend, seasonality and random components
plot(decompose(ts_val))
# Extracting the seasonality component from the decomposed part
ts_seasonal<-as.numeric(decompose(ts_val)$seasonal)
# Normalizing the component so that it becomes a ratio instead of absolute values
seas_factor<-ts_seasonal/max(ts_seasonal)
seas_df<-county.df%>%
mutate(Seasonality=seas_factor)
return(seas_df)
})
seas.final.df<-do.call(rbind.data.frame,l2)
head(seas.final.df)
# A tibble: 6 × 7
year year_month county category volume Month_Nm Seasonality
<dbl> <dttm> <chr> <chr> <dbl> <dbl> <dbl>
1 2017 2017-01-01 00:00:00 polk american vodka 64247. 1 -0.795
2 2017 2017-02-01 00:00:00 polk american vodka 73150. 2 -0.662
3 2017 2017-03-01 00:00:00 polk american vodka 77927. 3 -0.336
4 2017 2017-04-01 00:00:00 polk american vodka 81923. 4 -0.164
5 2017 2017-05-01 00:00:00 polk american vodka 85994. 5 1
6 2017 2017-06-01 00:00:00 polk american vodka 102798. 6 0.453
Now if youlook closely in the seas.final.df data frame, there are text values in it. However, for modelling using Xgboost, all the data should be in numeric form since the model cant understand text.Hence we will be converting text data to column using one hot encoding.
Step 5: Converting text values into Numeric data
interim.df<- seas.final.df%>%
select(-year_month)%>%
gather(key="Attributes",value = "Values",county:category)%>%
mutate(Count=1,
Index=1:n())%>%
spread(Values,Count,fill=0)%>%
select(-Attributes)%>%
group_by(year,Month_Nm,volume,Seasonality)%>%
summarise(across(everything(),list(sum)))%>%
select(-'american vodka_1',-Index_1) # as the column just has a single value and wont add much info to the model
`summarise()` has grouped output by 'year', 'Month_Nm', 'volume'. You can
override using the `.groups` argument.
head(interim.df)
# A tibble: 6 × 24
# Groups: year, Month_Nm, volume [6]
year Month_Nm volume Season…¹ black…² cerro…³ clint…⁴ dalla…⁵ des m…⁶ dicki…⁷
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017 1 1817. -0.215 0 0 0 0 0 0
2 2017 1 2656. -0.725 0 0 0 0 0 0
3 2017 1 3007. -0.124 0 0 0 0 0 0
4 2017 1 3217. -0.288 0 0 0 0 0 0
5 2017 1 3275. -0.863 0 0 0 0 0 0
6 2017 1 3861. -0.811 0 0 0 0 1 0
# … with 14 more variables: dubuque_1 <dbl>, johnson_1 <dbl>, lee_1 <dbl>,
# linn_1 <dbl>, marshall_1 <dbl>, muscatine_1 <dbl>, polk_1 <dbl>,
# pottawatta_1 <dbl>, scott_1 <dbl>, story_1 <dbl>, wapello_1 <dbl>,
# warren_1 <dbl>, webster_1 <dbl>, woodbury_1 <dbl>, and abbreviated variable
# names ¹Seasonality, ²`black hawk_1`, ³`cerro gord_1`, ⁴clinton_1,
# ⁵dallas_1, ⁶`des moines_1`, ⁷dickinson_1
We see that we now have individual columns for each of the counties and the data is the table is all numeric. Now it can be used for modelling purpose
Step 6: Introduction to Xgboost Model
https://www.geeksforgeeks.org/xgboost/ https://analyticsindiamag.com/xgboost-internal-working-to-make-decision-trees-and-deduce-predictions/
It consists of building successive decision trees in series where subsequent tree tries to reduce the error from the previous tree.The final output is taken from the output of last decison tree.
It is different from a random forest model where the output from each tree is averaged out at the end.
Before we go into an illustration, we should take note of the fact that there are few parameters that are used in the model.These are
\(\lambda\) regularization parameter eta how much model will converge(lets say it is 0.3) Æ” for auto tree pruning
Lets look at a small example to understand
Lets say we are trying to predict the IQ using xgboost. As the first decision tree, lets use mean IQ as the estimate. The mean IQ from the above table is around 30. Lets call this model as M0
Hence the error which is difference between Actual IQ and Mean IQ can be represented as below
Lets calculate the Similarity Score for M0 which is equal to Square of residual/(N+\(\lambda\) ),where N is the total number of residuals.At first,lets put Æ” as 0 =(-10 + 4 + 8)/(3+0) =1.33
Now lets create subsequent decision tree and call it M1
Lets use Age as the splitting variable and split at Age > 10.As a consequence, the decision tree will become something as shown below
Lets calculate the Similarity Score for each of these branches
Branch to the left: =(-10)^2/1 =100
Branch to the right: =(4+8)^2/2 =144/2 =72
Now the Gain as a result of the splitting
Gain = (100 + 72) - 1.33 Gain = 170.67
Now lets say that the threshold to create branching is 130(which is specified by the Æ” pruning parameter is 130). Branching will only take place if the gain is greater than 130. Since in our case, it is greater than 130, therefore branching will take place.Higher values of pruning parameters will prevent branching from taking place and will prune the tree
Now the interesting part about xgboost is that each successive tree takes the value of residuals as the dependent variable.Hence, new prediction are:
For the left branch =Old prediction + learning rate * average residual = 30 + 0.3*-10 =30 - 3 =27
For the right branch first record =Old prediction + learning rate * average residual =30 + 0.3*6 =30 + 1.8 =31.8
For the right branch second record =Old prediction + learning rate * average residual =30 + 0.3*6 =30 + 1.8 =31.8
Now the new table will become
You can see that the residuals have decreased in comparison to the Model M0
knitr::include_graphics("ex1_5.PNG")
We will now look at how to tune the parameters of an xgboost model.First we will divide the dataset into train and test
Step 7: Train and Test Dataset
# Cleaning column names
nm<-gsub("_1","",colnames(interim.df))
colnames(interim.df)<-nm
nm2<-gsub(" ","",colnames(interim.df))
colnames(interim.df)<-nm2
set.seed(1)
smpl<-sample(1:nrow(interim.df),0.70*nrow(interim.df),replace = F)
# Create the trianing dataset
train.df<-interim.df[smpl,]%>%
as.data.frame()%>%
as.matrix()
train.df2<-train.df[,colnames(train.df) != "volume"]
class(train.df)
[1] "matrix" "array"
# Create the test dataset
test.df<-as.matrix(interim.df[-smpl,])
test.df2<-test.df[,colnames(test.df) != "volume"]
y_train <- train.df%>%
as.data.frame()%>%
select(volume)%>%
pull(volume)%>%
as.numeric()
y_test <- test.df%>%
as.data.frame()%>%
select(volume)%>%
pull(volume)%>%
as.numeric()
dim(train.df2) # 644 records
[1] 644 23
Step 8: Hyper parameter tuning
Before we start and create a model, we have to first tune the parameters to get best results out of it.You can think of how we tune the radio to get to a particular channel
There is a library in R by the name mlr which is used to tune the parameters of all ml models.There are couple of standard steps for tuning the parameters of the model.They are:
- Initiate a learner
- Set a grid
- Controlling the number of iterations
- Setting up cross validation
- Create a task
- Tune Parameters
Step 8A: Initiate a regression learner
xgb_learner <- makeLearner(
"regr.xgboost",
predict.type = "response",
par.vals = list(
objective = "reg:squarederror",
eval_metric = "rmse",
nrounds = 200
)
)
Since the problem is a regression one, we have used regr.xgboost and predict.type as equal to response
Step 8B: Set grid parameters
xgb_params <- makeParamSet(
# The number of trees in the model (each one built sequentially)
makeIntegerParam("early_stopping_rounds", lower = 5, upper = 15),
makeIntegerParam("nrounds", lower = 100, upper = 500),
# number of splits in each tree
makeIntegerParam("max_depth", lower = 5, upper = 10),
# "shrinkage" - prevents overfitting
makeNumericParam("eta", lower = .001, upper = .25)
)
Step 8C: Controlling the number of iterations
control <- makeTuneControlRandom(maxit = 100)
Step 8D: Setting up Cross Validation (CV)
# Create a description of the resampling plan
resample_desc <- makeResampleDesc("CV", iters = 5)
Step 8E: Create a task
forecast_data_task<-makeRegrTask(data=train.df%>%
as.data.frame(),target="volume")
Step 8F: Tune Parameters
tuned_params <- tuneParams(
learner = xgb_learner,
task = forecast_data_task,
resampling = resample_desc,
par.set = xgb_params,
control = control
)
[Tune] Started tuning learner regr.xgboost for parameter set:
Type len Def Constr Req Tunable Trafo
early_stopping_rounds integer - - 5 to 15 - TRUE -
nrounds integer - - 100 to 500 - TRUE -
max_depth integer - - 5 to 10 - TRUE -
eta numeric - - 0.001 to 0.25 - TRUE -
With control class: TuneControlRandom
Imputation value: Inf
[Tune-x] 1: early_stopping_rounds=7; nrounds=276; max_depth=9; eta=0.0874
[Tune-y] 1: mse.test.mean=9759633.5783240; time: 0.2 min
[Tune-x] 2: early_stopping_rounds=6; nrounds=340; max_depth=10; eta=0.0978
[Tune-y] 2: mse.test.mean=9919169.3346103; time: 0.3 min
[Tune-x] 3: early_stopping_rounds=6; nrounds=423; max_depth=5; eta=0.191
[Tune-y] 3: mse.test.mean=9131655.5997733; time: 0.1 min
[Tune-x] 4: early_stopping_rounds=6; nrounds=442; max_depth=8; eta=0.149
[Tune-y] 4: mse.test.mean=9381837.0573715; time: 0.2 min
[Tune-x] 5: early_stopping_rounds=14; nrounds=447; max_depth=5; eta=0.118
[Tune-y] 5: mse.test.mean=9061322.5056555; time: 0.1 min
[Tune-x] 6: early_stopping_rounds=8; nrounds=430; max_depth=8; eta=0.11
[Tune-y] 6: mse.test.mean=9722901.3475809; time: 0.2 min
[Tune-x] 7: early_stopping_rounds=5; nrounds=199; max_depth=7; eta=0.0287
[Tune-y] 7: mse.test.mean=9405569.1843093; time: 0.1 min
[Tune-x] 8: early_stopping_rounds=9; nrounds=195; max_depth=10; eta=0.0575
[Tune-y] 8: mse.test.mean=10109152.9012836; time: 0.1 min
[Tune-x] 9: early_stopping_rounds=13; nrounds=387; max_depth=5; eta=0.12
[Tune-y] 9: mse.test.mean=9051325.2836154; time: 0.1 min
[Tune-x] 10: early_stopping_rounds=6; nrounds=205; max_depth=6; eta=0.0469
[Tune-y] 10: mse.test.mean=8706463.4047389; time: 0.1 min
[Tune-x] 11: early_stopping_rounds=5; nrounds=474; max_depth=5; eta=0.102
[Tune-y] 11: mse.test.mean=8919214.0957784; time: 0.1 min
[Tune-x] 12: early_stopping_rounds=13; nrounds=142; max_depth=7; eta=0.0841
[Tune-y] 12: mse.test.mean=8938006.7405959; time: 0.1 min
[Tune-x] 13: early_stopping_rounds=13; nrounds=315; max_depth=9; eta=0.2
[Tune-y] 13: mse.test.mean=9644585.0316658; time: 0.2 min
[Tune-x] 14: early_stopping_rounds=7; nrounds=286; max_depth=6; eta=0.137
[Tune-y] 14: mse.test.mean=9299234.3889818; time: 0.1 min
[Tune-x] 15: early_stopping_rounds=15; nrounds=159; max_depth=7; eta=0.197
[Tune-y] 15: mse.test.mean=9351119.0973083; time: 0.1 min
[Tune-x] 16: early_stopping_rounds=11; nrounds=316; max_depth=10; eta=0.193
[Tune-y] 16: mse.test.mean=9675316.8635050; time: 0.3 min
[Tune-x] 17: early_stopping_rounds=9; nrounds=476; max_depth=8; eta=0.143
[Tune-y] 17: mse.test.mean=9588756.3034094; time: 0.3 min
[Tune-x] 18: early_stopping_rounds=6; nrounds=255; max_depth=10; eta=0.0561
[Tune-y] 18: mse.test.mean=10065097.1353350; time: 0.2 min
[Tune-x] 19: early_stopping_rounds=7; nrounds=213; max_depth=5; eta=0.0459
[Tune-y] 19: mse.test.mean=8586267.9754261; time: 0.1 min
[Tune-x] 20: early_stopping_rounds=8; nrounds=358; max_depth=10; eta=0.0187
[Tune-y] 20: mse.test.mean=9830458.6477461; time: 0.2 min
[Tune-x] 21: early_stopping_rounds=5; nrounds=116; max_depth=7; eta=0.0638
[Tune-y] 21: mse.test.mean=9000524.7231902; time: 0.0 min
[Tune-x] 22: early_stopping_rounds=6; nrounds=445; max_depth=5; eta=0.138
[Tune-y] 22: mse.test.mean=9211781.8761617; time: 0.1 min
[Tune-x] 23: early_stopping_rounds=10; nrounds=224; max_depth=9; eta=0.187
[Tune-y] 23: mse.test.mean=9340147.3757327; time: 0.2 min
[Tune-x] 24: early_stopping_rounds=15; nrounds=245; max_depth=7; eta=0.114
[Tune-y] 24: mse.test.mean=9511333.8930402; time: 0.1 min
[Tune-x] 25: early_stopping_rounds=12; nrounds=312; max_depth=5; eta=0.216
[Tune-y] 25: mse.test.mean=8966000.4567943; time: 0.1 min
[Tune-x] 26: early_stopping_rounds=11; nrounds=461; max_depth=10; eta=0.225
[Tune-y] 26: mse.test.mean=9849430.1026096; time: 0.3 min
[Tune-x] 27: early_stopping_rounds=12; nrounds=435; max_depth=7; eta=0.158
[Tune-y] 27: mse.test.mean=9449230.0497988; time: 0.2 min
[Tune-x] 28: early_stopping_rounds=9; nrounds=488; max_depth=6; eta=0.245
[Tune-y] 28: mse.test.mean=9543746.5124885; time: 0.2 min
[Tune-x] 29: early_stopping_rounds=12; nrounds=464; max_depth=9; eta=0.104
[Tune-y] 29: mse.test.mean=10074564.2291761; time: 0.3 min
[Tune-x] 30: early_stopping_rounds=15; nrounds=159; max_depth=5; eta=0.145
[Tune-y] 30: mse.test.mean=8409795.4878290; time: 0.0 min
[Tune-x] 31: early_stopping_rounds=10; nrounds=216; max_depth=6; eta=0.152
[Tune-y] 31: mse.test.mean=9338310.5182170; time: 0.1 min
[Tune-x] 32: early_stopping_rounds=9; nrounds=216; max_depth=7; eta=0.206
[Tune-y] 32: mse.test.mean=9111410.5612743; time: 0.1 min
[Tune-x] 33: early_stopping_rounds=7; nrounds=495; max_depth=8; eta=0.208
[Tune-y] 33: mse.test.mean=9486996.2727286; time: 0.3 min
[Tune-x] 34: early_stopping_rounds=7; nrounds=409; max_depth=5; eta=0.129
[Tune-y] 34: mse.test.mean=9038940.1075297; time: 0.1 min
[Tune-x] 35: early_stopping_rounds=14; nrounds=495; max_depth=7; eta=0.118
[Tune-y] 35: mse.test.mean=9376826.7168538; time: 0.2 min
[Tune-x] 36: early_stopping_rounds=9; nrounds=467; max_depth=7; eta=0.236
[Tune-y] 36: mse.test.mean=9388504.9499630; time: 0.2 min
[Tune-x] 37: early_stopping_rounds=6; nrounds=326; max_depth=9; eta=0.00114
[Tune-y] 37: mse.test.mean=315762911.3693092; time: 0.1 min
[Tune-x] 38: early_stopping_rounds=11; nrounds=248; max_depth=5; eta=0.169
[Tune-y] 38: mse.test.mean=9010457.1250567; time: 0.1 min
[Tune-x] 39: early_stopping_rounds=11; nrounds=138; max_depth=7; eta=0.0255
[Tune-y] 39: mse.test.mean=11772205.7411346; time: 0.0 min
[Tune-x] 40: early_stopping_rounds=13; nrounds=427; max_depth=7; eta=0.00164
[Tune-y] 40: mse.test.mean=177810567.9447607; time: 0.1 min
[Tune-x] 41: early_stopping_rounds=11; nrounds=337; max_depth=10; eta=0.0991
[Tune-y] 41: mse.test.mean=10377173.8212600; time: 0.3 min
[Tune-x] 42: early_stopping_rounds=15; nrounds=202; max_depth=5; eta=0.132
[Tune-y] 42: mse.test.mean=8712549.6542992; time: 0.1 min
[Tune-x] 43: early_stopping_rounds=9; nrounds=309; max_depth=5; eta=0.15
[Tune-y] 43: mse.test.mean=9374192.4683403; time: 0.1 min
[Tune-x] 44: early_stopping_rounds=12; nrounds=320; max_depth=8; eta=0.0748
[Tune-y] 44: mse.test.mean=9541312.5051237; time: 0.2 min
[Tune-x] 45: early_stopping_rounds=8; nrounds=341; max_depth=6; eta=0.229
[Tune-y] 45: mse.test.mean=9379919.8475423; time: 0.1 min
[Tune-x] 46: early_stopping_rounds=15; nrounds=399; max_depth=7; eta=0.233
[Tune-y] 46: mse.test.mean=9131112.9625156; time: 0.2 min
[Tune-x] 47: early_stopping_rounds=11; nrounds=257; max_depth=6; eta=0.18
[Tune-y] 47: mse.test.mean=9323293.8399238; time: 0.1 min
[Tune-x] 48: early_stopping_rounds=15; nrounds=377; max_depth=10; eta=0.0579
[Tune-y] 48: mse.test.mean=10170544.1343835; time: 0.3 min
[Tune-x] 49: early_stopping_rounds=11; nrounds=183; max_depth=8; eta=0.242
[Tune-y] 49: mse.test.mean=9636149.5644085; time: 0.1 min
[Tune-x] 50: early_stopping_rounds=13; nrounds=181; max_depth=7; eta=0.207
[Tune-y] 50: mse.test.mean=9518563.4519001; time: 0.1 min
[Tune-x] 51: early_stopping_rounds=9; nrounds=258; max_depth=5; eta=0.0348
[Tune-y] 51: mse.test.mean=8506692.6637379; time: 0.1 min
[Tune-x] 52: early_stopping_rounds=12; nrounds=125; max_depth=10; eta=0.0293
[Tune-y] 52: mse.test.mean=10249077.1249290; time: 0.1 min
[Tune-x] 53: early_stopping_rounds=11; nrounds=110; max_depth=5; eta=0.0479
[Tune-y] 53: mse.test.mean=10886102.2646692; time: 0.0 min
[Tune-x] 54: early_stopping_rounds=13; nrounds=434; max_depth=9; eta=0.0709
[Tune-y] 54: mse.test.mean=9872622.3394087; time: 0.3 min
[Tune-x] 55: early_stopping_rounds=12; nrounds=484; max_depth=8; eta=0.108
[Tune-y] 55: mse.test.mean=10033723.8212178; time: 0.3 min
[Tune-x] 56: early_stopping_rounds=12; nrounds=169; max_depth=6; eta=0.0976
[Tune-y] 56: mse.test.mean=8765298.9059857; time: 0.1 min
[Tune-x] 57: early_stopping_rounds=5; nrounds=211; max_depth=9; eta=0.0828
[Tune-y] 57: mse.test.mean=9675908.9168039; time: 0.1 min
[Tune-x] 58: early_stopping_rounds=11; nrounds=310; max_depth=10; eta=0.134
[Tune-y] 58: mse.test.mean=10057024.4673807; time: 0.2 min
[Tune-x] 59: early_stopping_rounds=6; nrounds=411; max_depth=5; eta=0.177
[Tune-y] 59: mse.test.mean=9137584.3335131; time: 0.1 min
[Tune-x] 60: early_stopping_rounds=5; nrounds=413; max_depth=9; eta=0.139
[Tune-y] 60: mse.test.mean=9888009.8595234; time: 0.3 min
[Tune-x] 61: early_stopping_rounds=14; nrounds=184; max_depth=6; eta=0.169
[Tune-y] 61: mse.test.mean=9019567.7280693; time: 0.1 min
[Tune-x] 62: early_stopping_rounds=9; nrounds=364; max_depth=5; eta=0.0865
[Tune-y] 62: mse.test.mean=8657003.9123326; time: 0.1 min
[Tune-x] 63: early_stopping_rounds=13; nrounds=239; max_depth=5; eta=0.216
[Tune-y] 63: mse.test.mean=8986833.9177502; time: 0.1 min
[Tune-x] 64: early_stopping_rounds=11; nrounds=464; max_depth=9; eta=0.137
[Tune-y] 64: mse.test.mean=9720535.4019169; time: 0.3 min
[Tune-x] 65: early_stopping_rounds=10; nrounds=239; max_depth=8; eta=0.238
[Tune-y] 65: mse.test.mean=9675939.5121680; time: 0.1 min
[Tune-x] 66: early_stopping_rounds=7; nrounds=142; max_depth=6; eta=0.046
[Tune-y] 66: mse.test.mean=9312890.9201391; time: 0.0 min
[Tune-x] 67: early_stopping_rounds=10; nrounds=421; max_depth=7; eta=0.0452
[Tune-y] 67: mse.test.mean=9292563.8669782; time: 0.2 min
[Tune-x] 68: early_stopping_rounds=9; nrounds=426; max_depth=10; eta=0.0442
[Tune-y] 68: mse.test.mean=10332404.8956508; time: 0.3 min
[Tune-x] 69: early_stopping_rounds=8; nrounds=382; max_depth=10; eta=0.108
[Tune-y] 69: mse.test.mean=10496271.1894498; time: 0.3 min
[Tune-x] 70: early_stopping_rounds=6; nrounds=426; max_depth=9; eta=0.227
[Tune-y] 70: mse.test.mean=9243843.7546483; time: 0.3 min
[Tune-x] 71: early_stopping_rounds=12; nrounds=283; max_depth=5; eta=0.183
[Tune-y] 71: mse.test.mean=9225009.7384530; time: 0.1 min
[Tune-x] 72: early_stopping_rounds=15; nrounds=114; max_depth=5; eta=0.0997
[Tune-y] 72: mse.test.mean=8329617.3171094; time: 0.0 min
[Tune-x] 73: early_stopping_rounds=6; nrounds=145; max_depth=6; eta=0.146
[Tune-y] 73: mse.test.mean=8705292.4750901; time: 0.0 min
[Tune-x] 74: early_stopping_rounds=6; nrounds=298; max_depth=7; eta=0.102
[Tune-y] 74: mse.test.mean=9198731.7818394; time: 0.1 min
[Tune-x] 75: early_stopping_rounds=8; nrounds=270; max_depth=7; eta=0.207
[Tune-y] 75: mse.test.mean=9553370.3012008; time: 0.1 min
[Tune-x] 76: early_stopping_rounds=8; nrounds=275; max_depth=5; eta=0.242
[Tune-y] 76: mse.test.mean=9734527.9340654; time: 0.1 min
[Tune-x] 77: early_stopping_rounds=8; nrounds=111; max_depth=7; eta=0.16
[Tune-y] 77: mse.test.mean=8824551.0876736; time: 0.0 min
[Tune-x] 78: early_stopping_rounds=8; nrounds=494; max_depth=5; eta=0.0329
[Tune-y] 78: mse.test.mean=8394268.7106736; time: 0.1 min
[Tune-x] 79: early_stopping_rounds=7; nrounds=185; max_depth=10; eta=0.087
[Tune-y] 79: mse.test.mean=10225026.2280932; time: 0.1 min
[Tune-x] 80: early_stopping_rounds=9; nrounds=315; max_depth=6; eta=0.102
[Tune-y] 80: mse.test.mean=9109585.7187011; time: 0.1 min
[Tune-x] 81: early_stopping_rounds=9; nrounds=229; max_depth=5; eta=0.243
[Tune-y] 81: mse.test.mean=9370137.1522283; time: 0.1 min
[Tune-x] 82: early_stopping_rounds=6; nrounds=319; max_depth=10; eta=0.216
[Tune-y] 82: mse.test.mean=10259877.1528035; time: 0.3 min
[Tune-x] 83: early_stopping_rounds=11; nrounds=100; max_depth=6; eta=0.218
[Tune-y] 83: mse.test.mean=8880339.1142502; time: 0.0 min
[Tune-x] 84: early_stopping_rounds=10; nrounds=463; max_depth=10; eta=0.0794
[Tune-y] 84: mse.test.mean=10397728.9055980; time: 0.4 min
[Tune-x] 85: early_stopping_rounds=5; nrounds=264; max_depth=9; eta=0.238
[Tune-y] 85: mse.test.mean=9735613.3164361; time: 0.2 min
[Tune-x] 86: early_stopping_rounds=12; nrounds=307; max_depth=9; eta=0.213
[Tune-y] 86: mse.test.mean=9787244.9161651; time: 0.2 min
[Tune-x] 87: early_stopping_rounds=14; nrounds=389; max_depth=7; eta=0.11
[Tune-y] 87: mse.test.mean=9470438.5209672; time: 0.2 min
[Tune-x] 88: early_stopping_rounds=14; nrounds=254; max_depth=9; eta=0.0204
[Tune-y] 88: mse.test.mean=9476739.6403603; time: 0.1 min
[Tune-x] 89: early_stopping_rounds=14; nrounds=106; max_depth=5; eta=0.123
[Tune-y] 89: mse.test.mean=8350854.8763840; time: 0.0 min
[Tune-x] 90: early_stopping_rounds=12; nrounds=343; max_depth=10; eta=0.0776
[Tune-y] 90: mse.test.mean=10408315.5189620; time: 0.3 min
[Tune-x] 91: early_stopping_rounds=10; nrounds=395; max_depth=5; eta=0.171
[Tune-y] 91: mse.test.mean=9375547.4927445; time: 0.1 min
[Tune-x] 92: early_stopping_rounds=14; nrounds=247; max_depth=5; eta=0.13
[Tune-y] 92: mse.test.mean=8958574.7415755; time: 0.1 min
[Tune-x] 93: early_stopping_rounds=6; nrounds=461; max_depth=6; eta=0.156
[Tune-y] 93: mse.test.mean=9340260.2677384; time: 0.2 min
[Tune-x] 94: early_stopping_rounds=8; nrounds=416; max_depth=10; eta=0.172
[Tune-y] 94: mse.test.mean=10602911.5919019; time: 0.3 min
[Tune-x] 95: early_stopping_rounds=5; nrounds=484; max_depth=7; eta=0.136
[Tune-y] 95: mse.test.mean=9354475.2557501; time: 0.2 min
[Tune-x] 96: early_stopping_rounds=5; nrounds=164; max_depth=9; eta=0.0813
[Tune-y] 96: mse.test.mean=9768378.0063813; time: 0.1 min
[Tune-x] 97: early_stopping_rounds=11; nrounds=338; max_depth=5; eta=0.0991
[Tune-y] 97: mse.test.mean=9006327.1094235; time: 0.1 min
[Tune-x] 98: early_stopping_rounds=8; nrounds=273; max_depth=7; eta=0.187
[Tune-y] 98: mse.test.mean=9441931.2334022; time: 0.1 min
[Tune-x] 99: early_stopping_rounds=15; nrounds=187; max_depth=9; eta=0.00784
[Tune-y] 99: mse.test.mean=50989605.8255714; time: 0.1 min
[Tune-x] 100: early_stopping_rounds=15; nrounds=492; max_depth=9; eta=0.167
[Tune-y] 100: mse.test.mean=9521843.6192693; time: 0.3 min
[Tune] Result: early_stopping_rounds=15; nrounds=114; max_depth=5; eta=0.0997 : mse.test.mean=8329617.3171094
Step 9: Getting the optimal parameters
smpl<-0.5 # for sampling during training
stping_rounds<-tuned_params[[3]][[1]]
eta_par<-tuned_params[[3]][[4]]
nrounds_par<-tuned_params[[3]][[2]]
max_depth_par<-tuned_params[[3]][[3]]
txt<-str_c("subsample:",smpl,"_early_stopping_rounds:",stping_rounds,"_max_depth:",
max_depth_par,"_eta:",eta_par,"_nrounds:",nrounds_par)
txt
[1] "subsample:0.5_early_stopping_rounds:15_max_depth:5_eta:0.0997008429991547_nrounds:114"
Step 10: Training the model
xgb.model.best= xgboost(params =list(subsample = smpl),
data=train.df2, #train sparse matrix
label= y_train, #output vector to be predicted
eval.metric= 'rmse', #model minimizes Root Mean Squared Error
objective= "reg:squarederror", #regression
#tuning parameters
early_stopping_rounds = stping_rounds,
max_depth = max_depth_par,
eta = eta_par,
nrounds = nrounds_par,
verbose=1
)
[1] train-rmse:22784.155878
Will train until train_rmse hasn't improved in 15 rounds.
[2] train-rmse:20790.280270
[3] train-rmse:18973.552349
[4] train-rmse:17267.477319
[5] train-rmse:15798.425832
[6] train-rmse:14487.459047
[7] train-rmse:13290.274150
[8] train-rmse:12261.317795
[9] train-rmse:11287.366067
[10] train-rmse:10436.551972
[11] train-rmse:9695.970194
[12] train-rmse:9011.618926
[13] train-rmse:8410.848331
[14] train-rmse:7846.456868
[15] train-rmse:7302.792459
[16] train-rmse:6821.362806
[17] train-rmse:6454.191725
[18] train-rmse:6137.399849
[19] train-rmse:5854.895449
[20] train-rmse:5566.010415
[21] train-rmse:5295.302757
[22] train-rmse:5057.472774
[23] train-rmse:4811.764895
[24] train-rmse:4628.713444
[25] train-rmse:4457.816523
[26] train-rmse:4330.500426
[27] train-rmse:4190.613580
[28] train-rmse:4063.126750
[29] train-rmse:3933.762081
[30] train-rmse:3801.123928
[31] train-rmse:3684.602996
[32] train-rmse:3588.009602
[33] train-rmse:3507.782160
[34] train-rmse:3420.027090
[35] train-rmse:3359.482143
[36] train-rmse:3277.035913
[37] train-rmse:3195.524083
[38] train-rmse:3129.046678
[39] train-rmse:3057.615364
[40] train-rmse:3001.290765
[41] train-rmse:2930.848441
[42] train-rmse:2875.330856
[43] train-rmse:2826.674819
[44] train-rmse:2774.621092
[45] train-rmse:2726.212941
[46] train-rmse:2680.299123
[47] train-rmse:2633.844855
[48] train-rmse:2589.091061
[49] train-rmse:2537.431850
[50] train-rmse:2504.764415
[51] train-rmse:2469.064400
[52] train-rmse:2430.357590
[53] train-rmse:2400.879781
[54] train-rmse:2363.350657
[55] train-rmse:2333.978394
[56] train-rmse:2299.104966
[57] train-rmse:2268.021843
[58] train-rmse:2228.533381
[59] train-rmse:2199.257141
[60] train-rmse:2177.466456
[61] train-rmse:2132.963736
[62] train-rmse:2109.978926
[63] train-rmse:2093.805951
[64] train-rmse:2063.849087
[65] train-rmse:2040.386483
[66] train-rmse:2016.678816
[67] train-rmse:1993.706914
[68] train-rmse:1978.022977
[69] train-rmse:1948.613613
[70] train-rmse:1932.409334
[71] train-rmse:1916.209449
[72] train-rmse:1902.606214
[73] train-rmse:1886.405835
[74] train-rmse:1872.948578
[75] train-rmse:1860.896393
[76] train-rmse:1848.343301
[77] train-rmse:1835.585713
[78] train-rmse:1826.229335
[79] train-rmse:1816.928760
[80] train-rmse:1804.911331
[81] train-rmse:1794.877595
[82] train-rmse:1785.904686
[83] train-rmse:1771.771758
[84] train-rmse:1762.537625
[85] train-rmse:1743.674746
[86] train-rmse:1727.092388
[87] train-rmse:1716.174311
[88] train-rmse:1707.267788
[89] train-rmse:1701.654570
[90] train-rmse:1693.814856
[91] train-rmse:1670.058637
[92] train-rmse:1644.101291
[93] train-rmse:1637.279127
[94] train-rmse:1620.925174
[95] train-rmse:1616.062892
[96] train-rmse:1609.234957
[97] train-rmse:1593.808599
[98] train-rmse:1586.103599
[99] train-rmse:1581.036361
[100] train-rmse:1572.868570
[101] train-rmse:1559.990474
[102] train-rmse:1547.864086
[103] train-rmse:1537.398933
[104] train-rmse:1528.824548
[105] train-rmse:1523.308226
[106] train-rmse:1515.679599
[107] train-rmse:1504.320164
[108] train-rmse:1500.319942
[109] train-rmse:1494.758685
[110] train-rmse:1488.237479
[111] train-rmse:1472.578346
[112] train-rmse:1466.255644
[113] train-rmse:1458.498172
[114] train-rmse:1453.800531
Step 11: Checking accuracy on train data
fitted<-xgb.model.best%>%
predict(train.df2)
final.df2<-cbind(train.df2,y_train,fitted)
head(final.df2)
year Month_Nm Seasonality blackhawk cerrogord clinton dallas desmoines
[1,] 2020 6 0.21297876 1 0 0 0 0
[2,] 2019 10 0.05222602 0 0 0 0 0
[3,] 2017 7 1.00000000 0 0 0 0 0
[4,] 2019 2 -0.72872995 0 0 1 0 0
[5,] 2018 12 -0.63933027 0 0 0 0 0
[6,] 2018 3 -0.61333964 0 0 0 0 0
dickinson dubuque johnson lee linn marshall muscatine polk pottawatta
[1,] 0 0 0 0 0 0 0 0 0
[2,] 0 0 0 0 1 0 0 0 0
[3,] 1 0 0 0 0 0 0 0 0
[4,] 0 0 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 0 0 0 0
[6,] 0 0 0 0 1 0 0 0 0
scott story wapello warren webster woodbury y_train fitted
[1,] 0 0 0 0 0 0 23488.39 22384.041
[2,] 0 0 0 0 0 0 40141.11 41048.602
[3,] 0 0 0 0 0 0 9014.65 8287.662
[4,] 0 0 0 0 0 0 5852.30 5302.017
[5,] 0 1 0 0 0 0 9530.67 10278.403
[6,] 0 0 0 0 0 0 31092.09 30795.586
Gathering county data to make a single column for all counties
train.mape <- final.df2%>%
as.data.frame()%>%
ungroup()%>%
# head()%>%
mutate(Mape=(y_train - fitted)/y_train)%>%
mutate(Mape_Prec=abs(Mape))%>%
gather(key="County",value = "Values",blackhawk :scott)%>%
filter(Values==1)%>%
group_by(County)%>%
summarise(Accuracy_Train=1 - mean(Mape_Prec ))
head(train.mape)
# A tibble: 6 × 2
County Accuracy_Train
<chr> <dbl>
1 blackhawk 0.957
2 cerrogord 0.887
3 clinton 0.895
4 dallas 0.806
5 desmoines 0.905
6 dickinson 0.871
Step 12: Checking accuracy on test data
fitted<-xgb.model.best%>%
predict(test.df2)
final.df2<-cbind(test.df2,y_test,fitted)
head(final.df2)
year Month_Nm Seasonality blackhawk cerrogord clinton dallas desmoines
[1,] 2017 1 -0.7251846 0 0 0 0 0
[2,] 2017 1 -0.3864800 0 0 1 0 0
[3,] 2017 1 -0.2766859 0 0 0 1 0
[4,] 2017 1 -0.4690164 0 0 0 0 0
[5,] 2017 1 0.2058128 0 0 0 0 0
[6,] 2017 1 -0.5343577 0 0 0 0 0
dickinson dubuque johnson lee linn marshall muscatine polk pottawatta
[1,] 0 0 0 0 0 1 0 0 0
[2,] 0 0 0 0 0 0 0 0 0
[3,] 0 0 0 0 0 0 0 0 0
[4,] 0 0 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 0 0 0 0
[6,] 0 0 1 0 0 0 0 0 0
scott story wapello warren webster woodbury y_test fitted
[1,] 0 0 0 0 0 0 2655.64 3891.091
[2,] 0 0 0 0 0 0 4618.92 5409.865
[3,] 0 0 0 0 0 0 4842.17 6098.023
[4,] 0 0 0 0 0 1 7790.48 9749.588
[5,] 0 1 0 0 0 0 9406.41 12686.065
[6,] 0 0 0 0 0 0 20782.37 21141.416
# Mape on train data
test.mape <- final.df2%>%
as.data.frame()%>%
ungroup()%>%
# head()%>%
mutate(Mape=(y_test - fitted)/y_test)%>%
mutate(Mape_Prec=abs(Mape))%>%
gather(key="County",value = "Values",blackhawk :scott)%>%
filter(Values==1)%>%
group_by(County)%>%
summarise(Accuracy_Test=1 - mean(Mape_Prec ))
head(test.mape)
# A tibble: 6 × 2
County Accuracy_Test
<chr> <dbl>
1 blackhawk 0.925
2 cerrogord 0.891
3 clinton 0.882
4 dallas 0.656
5 desmoines 0.890
6 dickinson 0.869
Step 13: Coparing train and test accuracy
# Combining Both
final.df<-train.mape%>%
left_join(test.mape,"County")%>%
mutate(Diff=round(Accuracy_Train - Accuracy_Test , 2))%>%
arrange(desc(Diff))
final.df
# A tibble: 15 × 4
County Accuracy_Train Accuracy_Test Diff
<chr> <dbl> <dbl> <dbl>
1 dallas 0.806 0.656 0.15
2 marshall 0.830 0.687 0.14
3 scott 0.944 0.862 0.08
4 johnson 0.933 0.858 0.07
5 linn 0.958 0.898 0.06
6 polk 0.979 0.928 0.05
7 pottawatta 0.944 0.889 0.05
8 blackhawk 0.957 0.925 0.03
9 dubuque 0.928 0.902 0.03
10 lee 0.841 0.824 0.02
11 clinton 0.895 0.882 0.01
12 desmoines 0.905 0.890 0.01
13 muscatine 0.868 0.854 0.01
14 cerrogord 0.887 0.891 0
15 dickinson 0.871 0.869 0