Predicting MO with H2O Models from IRDA data

By Johanie Fournier, agr. in rstats tidyverse

November 1, 2023

H2O is a powerful tools to have a general idea of the performance of different models. Unfortunately this resource is constantly updating which make this structure hard to maintain in a production setting.

So I mainly use it to know which model perform best with my data. This make me gain a considerable amount of time. From there I can work on the few best models a make solid code for production purpose.

Let’s see if I can make a template of this H2O technologies that will resist the test of time (and constant update 🤣)!

Get the data

Let’s look at the data of Chaudière-Appalaches

board_prepared <- pins::board_folder(path_to_file, versioned = TRUE)

data <- board_prepared %>%
  pins::pin_read("TS_chaudiere_appalaches") %>%
  select(pct_mo, sable_tf, sable, argile, limon, geometry) %>%
  sf::st_as_sf(.) %>%
  sf::st_centroid() %>%
  mutate(
    longitude = sf::st_coordinates(.)[, 1],
    latitude = sf::st_coordinates(.)[, 2]
  ) %>%
  as.data.frame() %>%
  select(-geometry) %>%
  mutate_all(as.numeric) %>%
  drop_na()

Explore the data

  • Organic Matter distribution
data %>%
  select(pct_mo) %>%
  my_num_dist()
data <- data %>%
  filter(pct_mo > 0 & pct_mo < 10)
data %>%
  select(pct_mo) %>%
  my_num_dist()

Let’s see how everything is correlated

data %>%
  my_corr_num_graph(data)
## NULL

Build the model

# Parallele Processing
doFuture::registerDoFuture()
n_cores <- parallel::detectCores()
future::plan(
  strategy = future::cluster,
  workers  = parallel::makeCluster(n_cores)
)

Recipe

# Define recipe
recipe_spec <- recipe(as.formula(pct_mo ~ .), data = data) %>%
  step_normalize(all_predictors())

tbl_prep <- recipe_spec %>%
  prep() %>%
  juice()

head(tbl_prep)
## # A tibble: 6 × 7
##   sable_tf   sable argile  limon longitude latitude pct_mo
##      <dbl>   <dbl>  <dbl>  <dbl>     <dbl>    <dbl>  <dbl>
## 1   -1.67  -0.0817 -0.458  0.419    -1.02    -0.359   2.38
## 2   -1.67  -0.0817 -0.458  0.419    -1.00    -0.357   2.38
## 3   -1.67  -0.0817 -0.458  0.419    -0.958   -0.359   2.38
## 4   -1.67  -0.0817 -0.458  0.419    -0.950   -0.370   2.38
## 5   -0.930 -0.762   1.26   0.268    -1.23    -0.358   9.44
## 6   -0.172  0.677  -0.515 -0.638    -1.23    -0.356   9.11

CorrelationFunnel

var_select <- tbl_prep %>%
  select(pct_mo) %>%
  correlationfunnel::binarize() %>%
  select(starts_with("pct_mo") & ends_with("_Inf")) %>%
  names()

tbl_prep %>%
  correlationfunnel::binarize() %>%
  correlationfunnel::correlate(var_select) %>%
  correlationfunnel::plot_correlation_funnel() +
  labs(title = "IRDA pct_mo - Correlation Funnel")

H2O Models

library(h2o)

# START H2O CLUSTER
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         23 hours 55 minutes 
##     H2O cluster timezone:       America/Toronto 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.34.0.3 
##     H2O cluster version age:    2 years and 25 days !!! 
##     H2O cluster name:           H2O_started_from_R_johaniefournier_xex156 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   2.98 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.1.1 (2021-08-10)
# h2o.shutdown(prompt = TRUE)

# H2O DATA PREP
train <- as.h2o(tbl_prep)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
# Identify the response column
y <- "pct_mo"

# Identify the predictor columns (remove response and ID column)
x <- setdiff(names(train), c(y))

# H2O AutoML Training
aml <- h2o.automl(
  y = y,
  x = x,
  training_frame = train,
  project_name = paste0("H20", y),
  max_runtime_secs = 1800,
  max_models = 10,
  exclude_algos = c("DeepLearning"),
  seed = 123
)
## 
  |                                                                            
  |                                                                      |   0%
## 12:05:29.372: New models will be added to existing leaderboard H20pct_mo@@pct_mo (leaderboard frame=null) with already 21 models.
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
## 12:06:06.208: StackedEnsemble_BestOfFamily_7_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_1 (built with AUTO metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=======                                                               |  11%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
## 12:06:40.437: StackedEnsemble_BestOfFamily_8_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_2 (built with AUTO metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:06:41.447: StackedEnsemble_AllModels_6_AutoML_6_20231101_120529 [StackedEnsemble all_2 (built with AUTO metalearner, using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
## 12:06:59.560: StackedEnsemble_BestOfFamily_9_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_3 (built with AUTO metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:07:00.574: StackedEnsemble_AllModels_7_AutoML_6_20231101_120529 [StackedEnsemble all_3 (built with AUTO metalearner, using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |==================                                                    |  26%
## 12:07:01.586: StackedEnsemble_BestOfFamily_10_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_4 (built with AUTO metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |===================                                                   |  27%
## 12:07:02.602: StackedEnsemble_AllModels_8_AutoML_6_20231101_120529 [StackedEnsemble all_4 (built with AUTO metalearner, using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:07:03.614: StackedEnsemble_BestOfFamily_11_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_5 (built with AUTO metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |=====================                                                 |  30%
## 12:07:04.634: StackedEnsemble_AllModels_9_AutoML_6_20231101_120529 [StackedEnsemble all_5 (built with AUTO metalearner, using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |========================                                              |  34%
## 12:07:05.651: StackedEnsemble_BestOfFamily_12_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_xgboost (built with xgboost metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |===========================                                           |  38%
## 12:07:06.671: StackedEnsemble_BestOfFamily_13_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_gbm (built with gbm metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:07:07.678: StackedEnsemble_AllModels_10_AutoML_6_20231101_120529 [StackedEnsemble all_xgboost (built with xgboost metalearner, using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |============================                                          |  40%
## 12:07:08.688: StackedEnsemble_AllModels_11_AutoML_6_20231101_120529 [StackedEnsemble all_gbm (built with gbm metalearner, using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |=============================                                         |  42%
## 12:07:09.710: StackedEnsemble_BestOfFamily_14_AutoML_6_20231101_120529 [StackedEnsemble best_of_family_xglm (built with AUTO metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |================================                                      |  46%
## 12:07:10.721: StackedEnsemble_AllModels_12_AutoML_6_20231101_120529 [StackedEnsemble all_xglm (built with AUTO metalearner, using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:07:11.737: StackedEnsemble_BestOfFamily_15_AutoML_6_20231101_120529 [StackedEnsemble best_of_family (built with AUTO metalearner, using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |=================================                                     |  48%
## 12:07:12.746: StackedEnsemble_Best1000_1_AutoML_6_20231101_120529 [StackedEnsemble best_N (built with AUTO metalearner, using best 1000 non-SE models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |======================================================================| 100%
# H2O AutoML Leaderboard
aml@leaderboard %>%
  as.data.frame() %>%
  mutate_if(is.numeric, round, digits = 3) %>%
  DT::datatable(options = list(pageLength = 16))

Look at the results

All 10 models

model_ids <- as_tibble(aml@leaderboard$model_id)[, 1] %>%
  mutate(n = row_number())

model_ident <- model_ids %>%
  filter(model_id %in% "StackedEnsemble_AllModels_4_AutoML_5_20231031_160025")

model_all <- h2o.getModel(model_ident$model_id)
model_all
## Model Details:
## ==============
## 
## H2ORegressionModel: stackedensemble
## Model ID:  StackedEnsemble_AllModels_4_AutoML_5_20231031_160025 
## Number of Base Models: 10
## 
## Base Models (count by algorithm type):
## 
##     drf     gbm     glm xgboost 
##       2       4       1       3 
## 
## Metalearner:
## 
## Metalearner algorithm: gbm
## Metalearner cross-validation fold assignment:
##   Fold assignment scheme: AUTO
##   Number of folds: 5
##   Fold column: NULL
## Metalearner hyperparameters: 
## 
## 
## H2ORegressionMetrics: stackedensemble
## ** Reported on training data. **
## 
## MSE:  0.01119969
## RMSE:  0.1058286
## MAE:  0.04247019
## RMSLE:  0.01331242
## Mean Residual Deviance :  0.01119969
## 
## 
## 
## H2ORegressionMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.0525015
## RMSE:  0.229132
## MAE:  0.07341821
## RMSLE:  0.02842482
## Mean Residual Deviance :  0.0525015
model_ident <- model_ids %>%
  filter(model_id %in% "StackedEnsemble_BestOfFamily_5_AutoML_5_20231031_160025")
model_best_of <- h2o.getModel(model_ident$model_id)
model_best_of
## Model Details:
## ==============
## 
## H2ORegressionModel: stackedensemble
## Model ID:  StackedEnsemble_BestOfFamily_5_AutoML_5_20231031_160025 
## Number of Base Models: 5
## 
## Base Models (count by algorithm type):
## 
##     drf     gbm     glm xgboost 
##       2       1       1       1 
## 
## Metalearner:
## 
## Metalearner algorithm: gbm
## Metalearner cross-validation fold assignment:
##   Fold assignment scheme: AUTO
##   Number of folds: 5
##   Fold column: NULL
## Metalearner hyperparameters: 
## 
## 
## H2ORegressionMetrics: stackedensemble
## ** Reported on training data. **
## 
## MSE:  0.01249409
## RMSE:  0.1117769
## MAE:  0.04235414
## RMSLE:  0.01408869
## Mean Residual Deviance :  0.01249409
## 
## 
## 
## H2ORegressionMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.05346869
## RMSE:  0.231233
## MAE:  0.0735333
## RMSLE:  0.02866697
## Mean Residual Deviance :  0.05346869
model_ident <- model_ids %>%
  filter(model_id %in% "GBM_4_AutoML_5_20231031_160025")
model_gbm <- h2o.getModel(model_ident$model_id)
model_gbm
## Model Details:
## ==============
## 
## H2ORegressionModel: gbm
## Model ID:  GBM_4_AutoML_5_20231031_160025 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             296                      296              550589        10
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1        10   10.00000         17        392   143.48311
## 
## 
## H2ORegressionMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.02034233
## RMSE:  0.1426266
## MAE:  0.05560683
## RMSLE:  0.01756616
## Mean Residual Deviance :  0.02034233
## 
## 
## 
## H2ORegressionMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.0562326
## RMSE:  0.2371341
## MAE:  0.08748792
## RMSLE:  0.03014461
## Mean Residual Deviance :  0.0562326
## 
## 
## Cross-Validation Metrics Summary: 
##                            mean       sd cv_1_valid cv_2_valid cv_3_valid
## mae                    0.087488 0.002646   0.087158   0.089488   0.083020
## mean_residual_deviance 0.056232 0.007035   0.052462   0.063129   0.045919
## mse                    0.056232 0.007035   0.052462   0.063129   0.045919
## r2                     0.987552 0.001681   0.988628   0.985815   0.989767
## residual_deviance      0.056232 0.007035   0.052462   0.063129   0.045919
## rmse                   0.236748 0.015113   0.229045   0.251256   0.214288
## rmsle                  0.030106 0.001708   0.029139   0.031774   0.027593
##                        cv_4_valid cv_5_valid
## mae                      0.088955   0.088818
## mean_residual_deviance   0.058409   0.061242
## mse                      0.058409   0.061242
## r2                       0.987488   0.986064
## residual_deviance        0.058409   0.061242
## rmse                     0.241681   0.247472
## rmsle                    0.031009   0.031014
h2o::h2o.varimp_plot(model_gbm)
# End Parallele Processing
future::plan(future::sequential)

Session Info

git2r::repository()
## Local:    main /Users/johaniefournier/Library/Mobile Documents/com~apple~CloudDocs/ADV/johaniefournier.com
## Remote:   main @ origin (https://github.com/Jofou/johaniefournier.com.git)
## Head:     [7187110] 2023-11-01: update cheat sheet and H2O
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] h2o_3.34.0.3         embed_0.1.5          jofou.lib_0.0.0.9000
##  [4] yardstick_1.2.0      workflowsets_0.1.0   workflows_0.2.4     
##  [7] tune_0.1.6           rsample_0.1.0        recipes_0.1.17      
## [10] parsnip_0.1.7        modeldata_0.1.1      infer_1.0.0         
## [13] dials_0.0.10         scales_1.2.1         broom_1.0.4         
## [16] tidymodels_0.1.4     lubridate_1.9.2      forcats_1.0.0       
## [19] stringr_1.5.0        dplyr_1.1.2          purrr_1.0.1         
## [22] readr_2.1.4          tidyr_1.3.0          tibble_3.2.1        
## [25] ggplot2_3.4.2        tidyverse_2.0.0      knitr_1.36          
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_2.0-2     class_7.3-19         base64enc_0.1-3     
##  [4] rstudioapi_0.14      listenv_0.8.0        furrr_0.2.3         
##  [7] prodlim_2019.11.13   fansi_0.5.0          R.methodsS3_1.8.1   
## [10] codetools_0.2-18     splines_4.1.1        zeallot_0.1.0       
## [13] jsonlite_1.8.4       R.oo_1.24.0          png_0.1-7           
## [16] tfruns_1.5.0         uwot_0.1.10          compiler_4.1.1      
## [19] backports_1.4.1      assertthat_0.2.1     Matrix_1.3-4        
## [22] fastmap_1.1.0        cli_3.6.1            htmltools_0.5.2     
## [25] tools_4.1.1          gtable_0.3.0         glue_1.6.2          
## [28] Rcpp_1.0.7           jquerylib_0.1.4      styler_1.7.0        
## [31] DiceDesign_1.9       vctrs_0.6.2          blogdown_1.9.4      
## [34] iterators_1.0.13     timeDate_3043.102    gower_0.2.2         
## [37] xfun_0.30            globals_0.14.0       timechange_0.1.1    
## [40] lifecycle_1.0.3      future_1.22.1        MASS_7.3-54         
## [43] ipred_0.9-12         hms_1.1.3            parallel_4.1.1      
## [46] yaml_2.2.1           reticulate_1.22-9000 keras_2.7.0         
## [49] sass_0.4.0           rpart_4.1-15         stringi_1.7.5       
## [52] tensorflow_2.7.0     foreach_1.5.1        lhs_1.1.3           
## [55] hardhat_1.3.0        lava_1.6.10          rlang_1.1.1         
## [58] pkgconfig_2.0.3      bitops_1.0-7         evaluate_0.14       
## [61] lattice_0.20-44      tidyselect_1.2.0     parallelly_1.28.1   
## [64] magrittr_2.0.3       bookdown_0.24        R6_2.5.1            
## [67] generics_0.1.3       pillar_1.9.0         whisker_0.4         
## [70] withr_2.5.0          survival_3.2-11      RCurl_1.98-1.5      
## [73] nnet_7.3-16          future.apply_1.8.1   crayon_1.4.2        
## [76] utf8_1.2.2           tzdb_0.1.2           rmarkdown_2.11      
## [79] emo_0.0.0.9000       grid_4.1.1           git2r_0.28.0        
## [82] digest_0.6.29        R.cache_0.15.0       R.utils_2.11.0      
## [85] GPfit_1.0-8          munsell_0.5.0        bslib_0.3.1
Posted on:
November 1, 2023
Length:
11 minute read, 2226 words
Categories:
rstats tidyverse
Tags:
rstats tidyverse
See Also:
TyT2024W21 - VIZ:Carbon Majors Emissions Data
TyT2024W21 - ML:Carbon Majors Emissions Data
TyT2024W21 - EDA:Carbon Majors Emissions Data