Predicting MO with H2O Models from IRDA data
By Johanie Fournier, agr. in rstats tidyverse
November 1, 2023
H2O is a powerful tools to have a general idea of the performance of different models. Unfortunately this resource is constantly updating which make this structure hard to maintain in a production setting.
So I mainly use it to know which model perform best with my data. This make me gain a considerable amount of time. From there I can work on the few best models a make solid code for production purpose.
Let’s see if I can make a template of this H2O technologies that will resist the test of time (and constant update 🤣)!
Get the data
Let’s look at the data of Chaudière-Appalaches
board_prepared <- pins::board_folder(path_to_file, versioned = TRUE)
data <- board_prepared %>%
pins::pin_read("TS_chaudiere_appalaches") %>%
select(pct_mo, sable_tf, sable, argile, limon, geometry) %>%
sf::st_as_sf(.) %>%
sf::st_centroid() %>%
mutate(
longitude = sf::st_coordinates(.)[, 1],
latitude = sf::st_coordinates(.)[, 2]
) %>%
as.data.frame() %>%
select(-geometry) %>%
mutate_all(as.numeric) %>%
drop_na()
Explore the data
- Organic Matter distribution
data %>%
select(pct_mo) %>%
my_num_dist()
data <- data %>%
filter(pct_mo > 0 & pct_mo < 10)
data %>%
select(pct_mo) %>%
my_num_dist()
Let’s see how everything is correlated
data %>%
my_corr_num_graph(data)
## NULL
Build the model
# Parallele Processing
doFuture::registerDoFuture()
n_cores <- parallel::detectCores()
future::plan(
strategy = future::cluster,
workers = parallel::makeCluster(n_cores)
)
Recipe
# Define recipe
recipe_spec <- recipe(as.formula(pct_mo ~ .), data = data) %>%
step_normalize(all_predictors())
tbl_prep <- recipe_spec %>%
prep() %>%
juice()
head(tbl_prep)
## # A tibble: 6 × 7
## sable_tf sable argile limon longitude latitude pct_mo
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -1.67 -0.0817 -0.458 0.419 -1.02 -0.359 2.38
## 2 -1.67 -0.0817 -0.458 0.419 -1.00 -0.357 2.38
## 3 -1.67 -0.0817 -0.458 0.419 -0.958 -0.359 2.38
## 4 -1.67 -0.0817 -0.458 0.419 -0.950 -0.370 2.38
## 5 -0.930 -0.762 1.26 0.268 -1.23 -0.358 9.44
## 6 -0.172 0.677 -0.515 -0.638 -1.23 -0.356 9.11
CorrelationFunnel
var_select <- tbl_prep %>%
select(pct_mo) %>%
correlationfunnel::binarize() %>%
select(starts_with("pct_mo") & ends_with("_Inf")) %>%
names()
tbl_prep %>%
correlationfunnel::binarize() %>%
correlationfunnel::correlate(var_select) %>%
correlationfunnel::plot_correlation_funnel() +
labs(title = "IRDA pct_mo - Correlation Funnel")
H2O Models
library(h2o)
# START H2O CLUSTER
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 23 hours 55 minutes
## H2O cluster timezone: America/Toronto
## H2O data parsing timezone: UTC
## H2O cluster version: 3.34.0.3
## H2O cluster version age: 2 years and 25 days !!!
## H2O cluster name: H2O_started_from_R_johaniefournier_xex156
## H2O cluster total nodes: 1
## H2O cluster total memory: 2.98 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 4.1.1 (2021-08-10)
# h2o.shutdown(prompt = TRUE)
# H2O DATA PREP
train <- as.h2o(tbl_prep)
##
|
| | 0%
|
|======================================================================| 100%
# Identify the response column
y <- "pct_mo"
# Identify the predictor columns (remove response and ID column)
x <- setdiff(names(train), c(y))
# H2O AutoML Training
aml <- h2o.automl(
y = y,
x = x,
training_frame = train,
project_name = paste0("H20", y),
max_runtime_secs = 1800,
max_models = 10,
exclude_algos = c("DeepLearning"),
seed = 123
)
##
|
| | 0%
## 12:05:29.372: New models will be added to existing leaderboard H20pct_mo@@pct_mo (leaderboard frame=null) with already 21 models.
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|=== | 4%
|
|==== | 5%
|
|==== | 6%
## 12:06:06.208: StackedEnsemble_BestOfFamily_7_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_1 (built with AUTO metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======= | 11%
|
|========= | 12%
|
|========= | 13%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 16%
|
|============ | 17%
## 12:06:40.437: StackedEnsemble_BestOfFamily_8_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_2 (built with AUTO metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:06:41.447: StackedEnsemble_AllModels_6_AutoML_6_20231101_120529 [StackedEnsemble all_2 (built with AUTO metalearner, using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
|
|============== | 19%
|
|============== | 20%
|
|=============== | 21%
|
|================ | 22%
|
|================ | 23%
|
|================= | 24%
## 12:06:59.560: StackedEnsemble_BestOfFamily_9_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_3 (built with AUTO metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:07:00.574: StackedEnsemble_AllModels_7_AutoML_6_20231101_120529 [StackedEnsemble all_3 (built with AUTO metalearner, using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
|
|================== | 26%
## 12:07:01.586: StackedEnsemble_BestOfFamily_10_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_4 (built with AUTO metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
|
|=================== | 27%
## 12:07:02.602: StackedEnsemble_AllModels_8_AutoML_6_20231101_120529 [StackedEnsemble all_4 (built with AUTO metalearner, using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:07:03.614: StackedEnsemble_BestOfFamily_11_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_5 (built with AUTO metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
|
|===================== | 30%
## 12:07:04.634: StackedEnsemble_AllModels_9_AutoML_6_20231101_120529 [StackedEnsemble all_5 (built with AUTO metalearner, using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
|
|======================== | 34%
## 12:07:05.651: StackedEnsemble_BestOfFamily_12_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_xgboost (built with xgboost metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
|
|=========================== | 38%
## 12:07:06.671: StackedEnsemble_BestOfFamily_13_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_gbm (built with gbm metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:07:07.678: StackedEnsemble_AllModels_10_AutoML_6_20231101_120529 [StackedEnsemble all_xgboost (built with xgboost metalearner, using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
|
|============================ | 40%
## 12:07:08.688: StackedEnsemble_AllModels_11_AutoML_6_20231101_120529 [StackedEnsemble all_gbm (built with gbm metalearner, using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
|
|============================= | 42%
## 12:07:09.710: StackedEnsemble_BestOfFamily_14_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_xglm (built with AUTO metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
|
|================================ | 46%
## 12:07:10.721: StackedEnsemble_AllModels_12_AutoML_6_20231101_120529 [StackedEnsemble all_xglm (built with AUTO metalearner, using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:07:11.737: StackedEnsemble_BestOfFamily_15_AutoML_6_20231101_120529 [StackedEnsemble best_of_family (built with AUTO metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
|
|================================= | 48%
## 12:07:12.746: StackedEnsemble_Best1000_1_AutoML_6_20231101_120529 [StackedEnsemble best_N (built with AUTO metalearner, using best 1000 non-SE models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
|
|======================================================================| 100%
# H2O AutoML Leaderboard
aml@leaderboard %>%
as.data.frame() %>%
mutate_if(is.numeric, round, digits = 3) %>%
DT::datatable(options = list(pageLength = 16))
Look at the results
All 10 models
model_ids <- as_tibble(aml@leaderboard$model_id)[, 1] %>%
mutate(n = row_number())
model_ident <- model_ids %>%
filter(model_id %in% "StackedEnsemble_AllModels_4_AutoML_5_20231031_160025")
model_all <- h2o.getModel(model_ident$model_id)
model_all
## Model Details:
## ==============
##
## H2ORegressionModel: stackedensemble
## Model ID: StackedEnsemble_AllModels_4_AutoML_5_20231031_160025
## Number of Base Models: 10
##
## Base Models (count by algorithm type):
##
## drf gbm glm xgboost
## 2 4 1 3
##
## Metalearner:
##
## Metalearner algorithm: gbm
## Metalearner cross-validation fold assignment:
## Fold assignment scheme: AUTO
## Number of folds: 5
## Fold column: NULL
## Metalearner hyperparameters:
##
##
## H2ORegressionMetrics: stackedensemble
## ** Reported on training data. **
##
## MSE: 0.01119969
## RMSE: 0.1058286
## MAE: 0.04247019
## RMSLE: 0.01331242
## Mean Residual Deviance : 0.01119969
##
##
##
## H2ORegressionMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.0525015
## RMSE: 0.229132
## MAE: 0.07341821
## RMSLE: 0.02842482
## Mean Residual Deviance : 0.0525015
model_ident <- model_ids %>%
filter(model_id %in% "StackedEnsemble_BestOfFamily_5_AutoML_5_20231031_160025")
model_best_of <- h2o.getModel(model_ident$model_id)
model_best_of
## Model Details:
## ==============
##
## H2ORegressionModel: stackedensemble
## Model ID: StackedEnsemble_BestOfFamily_5_AutoML_5_20231031_160025
## Number of Base Models: 5
##
## Base Models (count by algorithm type):
##
## drf gbm glm xgboost
## 2 1 1 1
##
## Metalearner:
##
## Metalearner algorithm: gbm
## Metalearner cross-validation fold assignment:
## Fold assignment scheme: AUTO
## Number of folds: 5
## Fold column: NULL
## Metalearner hyperparameters:
##
##
## H2ORegressionMetrics: stackedensemble
## ** Reported on training data. **
##
## MSE: 0.01249409
## RMSE: 0.1117769
## MAE: 0.04235414
## RMSLE: 0.01408869
## Mean Residual Deviance : 0.01249409
##
##
##
## H2ORegressionMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.05346869
## RMSE: 0.231233
## MAE: 0.0735333
## RMSLE: 0.02866697
## Mean Residual Deviance : 0.05346869
model_ident <- model_ids %>%
filter(model_id %in% "GBM_4_AutoML_5_20231031_160025")
model_gbm <- h2o.getModel(model_ident$model_id)
model_gbm
## Model Details:
## ==============
##
## H2ORegressionModel: gbm
## Model ID: GBM_4_AutoML_5_20231031_160025
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 296 296 550589 10
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 10 10.00000 17 392 143.48311
##
##
## H2ORegressionMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.02034233
## RMSE: 0.1426266
## MAE: 0.05560683
## RMSLE: 0.01756616
## Mean Residual Deviance : 0.02034233
##
##
##
## H2ORegressionMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.0562326
## RMSE: 0.2371341
## MAE: 0.08748792
## RMSLE: 0.03014461
## Mean Residual Deviance : 0.0562326
##
##
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid cv_3_valid
## mae 0.087488 0.002646 0.087158 0.089488 0.083020
## mean_residual_deviance 0.056232 0.007035 0.052462 0.063129 0.045919
## mse 0.056232 0.007035 0.052462 0.063129 0.045919
## r2 0.987552 0.001681 0.988628 0.985815 0.989767
## residual_deviance 0.056232 0.007035 0.052462 0.063129 0.045919
## rmse 0.236748 0.015113 0.229045 0.251256 0.214288
## rmsle 0.030106 0.001708 0.029139 0.031774 0.027593
## cv_4_valid cv_5_valid
## mae 0.088955 0.088818
## mean_residual_deviance 0.058409 0.061242
## mse 0.058409 0.061242
## r2 0.987488 0.986064
## residual_deviance 0.058409 0.061242
## rmse 0.241681 0.247472
## rmsle 0.031009 0.031014
h2o::h2o.varimp_plot(model_gbm)
# End Parallele Processing
future::plan(future::sequential)
Session Info
git2r::repository()
## Local: main /Users/johaniefournier/Library/Mobile Documents/com~apple~CloudDocs/ADV/johaniefournier.com
## Remote: main @ origin (https://github.com/Jofou/johaniefournier.com.git)
## Head: [7187110] 2023-11-01: update cheat sheet and H2O
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] h2o_3.34.0.3 embed_0.1.5 jofou.lib_0.0.0.9000
## [4] yardstick_1.2.0 workflowsets_0.1.0 workflows_0.2.4
## [7] tune_0.1.6 rsample_0.1.0 recipes_0.1.17
## [10] parsnip_0.1.7 modeldata_0.1.1 infer_1.0.0
## [13] dials_0.0.10 scales_1.2.1 broom_1.0.4
## [16] tidymodels_0.1.4 lubridate_1.9.2 forcats_1.0.0
## [19] stringr_1.5.0 dplyr_1.1.2 purrr_1.0.1
## [22] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
## [25] ggplot2_3.4.2 tidyverse_2.0.0 knitr_1.36
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-2 class_7.3-19 base64enc_0.1-3
## [4] rstudioapi_0.14 listenv_0.8.0 furrr_0.2.3
## [7] prodlim_2019.11.13 fansi_0.5.0 R.methodsS3_1.8.1
## [10] codetools_0.2-18 splines_4.1.1 zeallot_0.1.0
## [13] jsonlite_1.8.4 R.oo_1.24.0 png_0.1-7
## [16] tfruns_1.5.0 uwot_0.1.10 compiler_4.1.1
## [19] backports_1.4.1 assertthat_0.2.1 Matrix_1.3-4
## [22] fastmap_1.1.0 cli_3.6.1 htmltools_0.5.2
## [25] tools_4.1.1 gtable_0.3.0 glue_1.6.2
## [28] Rcpp_1.0.7 jquerylib_0.1.4 styler_1.7.0
## [31] DiceDesign_1.9 vctrs_0.6.2 blogdown_1.9.4
## [34] iterators_1.0.13 timeDate_3043.102 gower_0.2.2
## [37] xfun_0.30 globals_0.14.0 timechange_0.1.1
## [40] lifecycle_1.0.3 future_1.22.1 MASS_7.3-54
## [43] ipred_0.9-12 hms_1.1.3 parallel_4.1.1
## [46] yaml_2.2.1 reticulate_1.22-9000 keras_2.7.0
## [49] sass_0.4.0 rpart_4.1-15 stringi_1.7.5
## [52] tensorflow_2.7.0 foreach_1.5.1 lhs_1.1.3
## [55] hardhat_1.3.0 lava_1.6.10 rlang_1.1.1
## [58] pkgconfig_2.0.3 bitops_1.0-7 evaluate_0.14
## [61] lattice_0.20-44 tidyselect_1.2.0 parallelly_1.28.1
## [64] magrittr_2.0.3 bookdown_0.24 R6_2.5.1
## [67] generics_0.1.3 pillar_1.9.0 whisker_0.4
## [70] withr_2.5.0 survival_3.2-11 RCurl_1.98-1.5
## [73] nnet_7.3-16 future.apply_1.8.1 crayon_1.4.2
## [76] utf8_1.2.2 tzdb_0.1.2 rmarkdown_2.11
## [79] emo_0.0.0.9000 grid_4.1.1 git2r_0.28.0
## [82] digest_0.6.29 R.cache_0.15.0 R.utils_2.11.0
## [85] GPfit_1.0-8 munsell_0.5.0 bslib_0.3.1
- Posted on:
- November 1, 2023
- Length:
- 11 minute read, 2226 words