Icon

kn_​automl_​h2o_​regression_​r

H2O.ai AutoML (wrapped with R) in KNIME for regression problems

H2O.ai AutoML (wrapped with R) in KNIME for regression problems - a powerful auto-machine-learning framework (https://hub.knime.com/mlauber71/spaces/Public/latest/automl/)
v 1.25

It features various models like Random Forest or XGBoost along with Deep Learning. It has warppers for R and Python but also could be used from KNIME. The results will be written to a folder and the models will be stored in MOJO format to be used in KNIME (as well as on a Big Data cluster via Sparkling Water). One major parameter to set is the running time the model has to test various models and do some hyper parameter optimization as well. The best model of each round is stored and some graphics are produced to see the results.

To run this workflow you have to install Python and H2O.ai as well as R and several packages. Please refer to the green box on the right.

The results may be used also on Big Data clusters with the help of H2O.ai Sparkling Water (https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_h2o_sparkling_water)

# Run AutoML for 60 seconds or# 300 = 5 min, 600 = 10 min, 900 = 15 min, 1800 = 30 min, 3600 = 1 hour, # 7200 = 2 hours# 14400 = 4 hours# 16200 = 4.5 hours# 18000 = 5 Stunden# 21600 = 6 hours# 25200 = 7 hours# 28800 = 8 hours# 36000 = 10 hours H2O.ai AutoML (wrapped with R) in KNIME for regression problems - a powerful auto-machine-learning framework (https://hub.knime.com/mlauber71/spaces/Public/latest/automl/)v 1.25It features various models like Random Forest or XGBoost along with Deep Learning. It has warppers for R and Python but also could be used from KNIME. The results will be written to a folder and the models will be stored in MOJO format to be used in KNIME (as well as on a Big Data cluster via Sparkling Water). One major parameterto set is the running time the model has to test various models and do some hyper parameter optimization as well. The best model of each round is stored and some graphics are produced to see the results.To run this workflow you have to install Python and H2O.ai as well as R and several packages. Please refer to the green box on the right.The results may be used also on Big Data clusters with the help of H2O.ai Sparkling Water (https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_h2o_sparkling_water) which output is there to be interpretedmodel are stored in the folder /model/<full model name>/<model name>.zip-> as MOJO model format (certain model types cannot be stored and reused - so they are excluded as of now)/model/<full model name>/<model name>-> genuine H2O model stored in a folder (can be reused from H2O itself - also could store the Stacked andEnsemble models as well as XGBoost)/model/validate/h2o_list_of_models.csv -> list of all leading model from the runs with their RMSE (among other things) --- individual model results/model/validate/H2O_AutoML_Regression_yyyymmdd_hhmmh.txt-> capture of a print command describing the winning modelmodel_table_H2O_AutoML_Regression_yyyymmdd_hhmmh.table-> a KNIME table with a collection of parameters and information about the modelH2O_AutoML_Regression_yyyymmdd_hhmmh....-> CSVfiles containing important information among these: - _leaderboard = the list of all tested models in the run - _model_summary = the characteristic of the winning model (depth - _variable_importances = !!! check if the variable importance does make senseH2O_AutoML_Regression_yyyymmdd_hhmmh.xlsx-> an Excel file containing important information among these: - model_eval = a check split up into several numeric bins to see if the model does perform across them- Bland_Altman = a Bland-Altman Plot (experimental)- all_stat = summary of statistics---- 4 graphics for each model to have visual support when interpreting the results (needs R)(for more details see /script/H2O.ai AutoML in KNIME for regression problems.pdf)model_graph_H2O_AutoML_Regression_yyyymmdd_hhmmh.png-> two lines set next to each other to represent the deviation in a linear formatmodel_graph_H2O_AutoML_Regression_yyyymmdd_hhmmh_hexbin.png-> a Hexbin Plot giving you a compact idea about the position of prediction (submission) and truth (solution) withregards to big blocks (are the large block positioned where you would like them)model_graph_H2O_AutoML_Regression_yyyymmdd_hhmmh_parallel_plot.png-> a parallel plot to see if there is a trend with regard to certain individual numbersmodel_graph_H2O_AutoML_Regression_yyyymmdd_hhmmh_bias.png-> a Bland-Altman plot input_table_2 <- knime.in# H2O.ai AutoML applied# library(arrow)library(h2o)# ---> EDIT:# workpath_r <- "/Users/m_lauber/Dropbox/knime-workspace/kaggle/census_income_bigdata_40/data/"# workpath_r <- "C:\\Users\\A9699459\\knime-workspace\\kaggle\\census_income_bigdata_40\\data"# workspace_name <- paste0(workpath_r, "workspace_410.RData")# setwd(workpath_r) # Set work directory# http://biostat.mc.vanderbilt.edu/wiki/pub/Main/ColeBeck/datestimes.pdfvar_timestamp_day <- format(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"), format = "%Y%m%d")# print("var_timestamp_day: ", var_timestamp_day)var_timestamp_time <- format(as.POSIXct(Sys.time(), format = "%Y-%m-%d %H:%M"), format = "%H%M")# print("var_timestamp_time: ", var_timestamp_time)# _edit: if you want to have another model namevar_model_name <- "H2O_AutoML_Regression"var_model_name_full <- paste0(var_model_name , "_" , var_timestamp_day , "_" , var_timestamp_time)# print("var_model_name_full: ", var_model_name_full)# import the train and test file from local drive# v_data_path <- c("/Users/m_lauber/Dropbox/knime-workspace/hub/automl/kn_automl_h2o_classification_r/data/")v_data_path <- c(knime.flow.in[["var_path_data"]])# input_table_1 <- as.data.frame(read_parquet(paste0(v_data_path,"train.parquet")))# input_table_2 <- as.data.frame(read_parquet(paste0(v_data_path,"test.parquet")))# _edit:# list variable names to be removed / not being used in model# this is here to demonstrate how this would be donev_remove_variables <- c("Date1", "Location")# grab the columns from the 1st dataframe# x <- colnames(input_table_1)# name the target variablesy = 'Target'v_dropList <- c(v_remove_variables, y)# temporary file that only has the df_x_vars <- head(input_table_1[, !colnames(input_table_1) %in% v_dropList])x<- colnames(df_x_vars)# save and load the working environment# save.image(workspace_name)# load(workspace_name)# h2o.shutdown(prompt = FALSE)h2o.init()# https://forum.knime.com/t/python-script-and-h2o-data-frames-error-under-windows/21099/4?u=mlauber71h2o.no_progress()h2o.clusterStatus()# When launching nodes, we recommend allocating a total of four times the memory of your data.# h2o.init(ice_root=c(workpath_mac))# h2o.init(nthreads=20, max_mem_size="12G")train <- as.h2o(input_table_1)valid <- as.h2o(input_table_2)# https://www.h2o.ai/blog/h2o-release-3-24-yates/# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html# https://www.rdocumentation.org/packages/h2o/versions/3.16.0.2/topics/h2o.automl# https://dzone.com/articles/a-trial-run-with-h2o-automl-automatic-machine-lear# "GLM", "DeepLearning", "DRF", "GBM", "StackedEnsemble", "XGBoost"# "GLM", "DeepLearning", "DRF", "GBM", "StackedEnsemble"# "DeepLearning", "StackedEnsemble", "XGBoost"# http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/automl/autoh2o.html# you could exclude algorithms as they might not be suitable # to be used in Big Data environments# exclude_algos =["GBM", "GLM", "DeepLearning", "DRF", "StackedEnsemble", "XGBoost"]# For binomial classification choose between "AUC", "logloss", "mean_per_class_error", "RMSE", "MSE". # For multinomial classification choose between "mean_per_class_error", "logloss", "RMSE", "MSE". # For regression choose between "deviance", "RMSE", "MSE", "MAE", "RMLSE".# Run AutoML for 60 seconds or# 300 = 5 min, 600 = 10 min, 900 = 15 min, 1800 = 30 min, 3600 = 1 hour, # 7200 = 2 hours# 14400 = 4 hours# 16200 = 4.5 hours# 18000 = 5 Stunden# 21600 = 6 hours# 25200 = 7 hours# 28800 = 8 hours# 36000 = 10 hours# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/sort_metric.html# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/exclude_algos.html# get the maximum runtime from the KNIME workflowmax_runtime_secs_opts <- knime.flow.in[["v_runtime_automl"]]aml <- h2o.automl(x = x, y = y, training_frame = train, validation_frame = valid, balance_classes=FALSE, max_runtime_secs=max_runtime_secs_opts, seed =1234, sort_metric ="RMSE", stopping_metric ="RMSE", stopping_tolerance =0.01, stopping_rounds =25, project_name =var_model_name_full , # exclude_algos =c("GBM", "GLM", "DeepLearning", "DRF", "StackedEnsemble") #, # exclude_algos =c("DRF", "GLM") exclude_algos =c("DeepLearning", "StackedEnsemble", "XGBoost"))# bring the leaderboard to a filelb <- aml@leaderboard# lb <- h2o.get_leaderboard(aml, extra_columns = "ALL")# print(lb)tb_leaderboard <- as.data.frame(lb)# var_selected_model <- "GBM_1_AutoML_20191214_123545"var_selected_model <- aml@leader@model_id# print("var_selected_model :", var_selected_model)# get the extracted modelextracted_model <- h2o.getModel(var_selected_model)# extract important tables from model to store latertb_variable_importances <- h2o.varimp(extracted_model)tb_model_summary <- as.data.frame(extracted_model@allparameters)# print(tb_variable_importances)# var_path_model <- c("/Users/m_lauber/Dropbox/knime-workspace/hub/automl/kn_automl_h2o_classification_r/model/")var_path_model <- c(knime.flow.in[["var_path_model"]])# var_path_validate <- c("/Users/m_lauber/Dropbox/knime-workspace/hub/automl/kn_automl_h2o_classification_r/model/validate/")var_path_validate <- c(knime.flow.in[["var_path_validate"]])v_csv_file_variable_importance <- paste0(var_path_validate, var_model_name_full, "_variable_importance.csv")write.table(tb_variable_importances, v_csv_file_variable_importance , sep="|")var_model_name_path <- paste0(var_path_model, var_model_name_full)# just save as 'normal' H2O Model - you can easily use later again with R# h2o.saveModel(aml@leader, path=c(var_loc_mojo_model), force=TRUE)h2o.saveModel(extracted_model, path=c(var_model_name_path), force=TRUE)# save the best model as MOJO model (seems to be a problem bringing it back from MOJO)# it seems you can save an ensemble but not bring it back, but anywayh2o.saveMojo(extracted_model, path = c(var_model_name_path), force=TRUE)var_model_file_name <- paste0(var_path_model, var_model_name_full, knime.flow.in[["file.separator"]] , var_selected_model)var_mojo_file_name <- paste0(var_path_model, var_model_name_full, knime.flow.in[["file.separator"]] , var_selected_model, ".zip")# saved_h2o_model<- h2o.loadModel(var_model_file_name)saved_mojo_model <- h2o.import_mojo(var_mojo_file_name)# https://www.kaggle.com/sramirez/deep-learning-in-r-with-h2otest_prediction <- as.data.frame(h2o.predict(saved_mojo_model, valid, type = 'raw'))output_table_2 <- cbind(input_table_2, test_prediction)v_csv_file_leaderboard <- paste0(var_path_validate, var_model_name_full, "_leaderboard.csv")write.table(tb_leaderboard, v_csv_file_leaderboard , sep="|")v_csv_file_model_summary <- paste0(var_path_validate, var_model_name_full, "_model_summary.csv")write.table(tb_model_summary, v_csv_file_model_summary , sep="|")var_txt_summary = paste0(var_path_validate, var_model_name_full, ".txt")sink(var_txt_summary)print(extracted_model)sink()# ------------------------------------------------------------------------# export the necessary variables as flow variables to KNIME# https://forum.knime.com/t/creating-flow-variables-in-r-scripting-nodes/5701/2?u=mlauber71# https://forum.knime.com/t/creating-flow-variables-with-r-snippet-works-but/9007?u=mlauber71knime.flow.out <- list(var_timestamp_day = var_timestamp_day, var_timestamp_time = var_timestamp_time, var_model_name = var_model_name, var_model_name_full = var_model_name_full, x = paste(c(x), collapse=', ' ), y = y, max_runtime_secs_opts = max_runtime_secs_opts, var_path_model = var_path_model, var_mojo_file_name = var_mojo_file_name, v_csv_file_variable_importance = v_csv_file_variable_importance, v_csv_file_leaderboard = v_csv_file_leaderboard, v_csv_file_model_summary = v_csv_file_model_summary, var_txt_summary = var_txt_summary)# shut down the H2O cluster in the end# h2o.shutdown(prompt = FALSE) Subfolders to check/data/ contains the original data/model/contains the stored models in MOJO and H2O format/model/validate/contains the validations and graphics/script/'pure' R code (if you do not wish to use the KNIME wrapper) kn_automl_h2o_regression_r.Ra PDF with further informations about the methods usedH2O.ai AutoML in KNIME for regression problems.pdf # make sure you have R and the necessary R packages installed, also check aout the pdf in /script/# Install R alongside KNIME on Windows and MacOS# https://forum.knime.com/t/install-r-alongside-knime-on-windows-and-macos/13287# R and Rtools# https://forum.knime.com/t/how-to-import-tables-from-docx-documents-via-r-snippet/19284/10# RServe 1.8.6 on MacOSX# https://forum.knime.com/t/installing-rserve-1-8-6-on-macos-10-15-catalina/20909/6?u=mlauber71library(h2o)# if you wish to use the 'pure' R code and import the data with parquetlibrary(arrow) R packages needed:ggplot2, lift, hexbin, scalesIf you use the R wrapper you will need the h2o package and the arrow package if you plan on using the pure Rscript in the /script/ subfolderhttp://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html Inspect the models so far and see to results. This will also give you a quick idea where you stand and what you would be able to achieve.Along with all parameters to load the respective model. House Prices - Advanced Regression TechniquesPredict sales prices and practice feature engineering, RFs, and gradient boostinghttps://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/evaluationMetricSubmissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted valueand the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses andcheap houses will affect the result equally.) NUM2.15train.tabletest.tablevar_model_name_fullRMSE ASC^(.*submission|solution).*$solution to doublekeep best modelinput_table_2and mainR codewrapperinput_table_1scored outputcreate initial Test andTraining dataKaggle House Pricesedit: v_runtime_automlset the maximum runtime ofH2O.ai AutoMLin secondsno pathsh2o_list_of_models.csvRead VariableimportanceRead the MOJOmodelRead the MOJOmodelScore the test tableyou might also use a third table to validatethat has not been used developing themodelextract parametersfrom Rwhich have been usedto calculate the modelh2o_list_of_models.csvappend if CSV already exists to collect allmodel runs Model QualityNumeric - Graphics Table Reader Table Reader Numeric Scorer Transpose Joiner ConstantValue Column Column Resorter RowID Sorter Column Rename Column Filter Math Formula Row Filter Add Table To R Table to R R to Table Test Training Integer Input collect meta data Merge Variables Column Filter CSV Reader Table Rowto Variable Column Filter CSV Reader String to Path(Variable) H2O MOJO Reader String to Path(Variable) H2O MOJO Reader H2O MOJO Predictor(Regression) Variable toTable Row CSV Writer # Run AutoML for 60 seconds or# 300 = 5 min, 600 = 10 min, 900 = 15 min, 1800 = 30 min, 3600 = 1 hour, # 7200 = 2 hours# 14400 = 4 hours# 16200 = 4.5 hours# 18000 = 5 Stunden# 21600 = 6 hours# 25200 = 7 hours# 28800 = 8 hours# 36000 = 10 hours H2O.ai AutoML (wrapped with R) in KNIME for regression problems - a powerful auto-machine-learning framework (https://hub.knime.com/mlauber71/spaces/Public/latest/automl/)v 1.25It features various models like Random Forest or XGBoost along with Deep Learning. It has warppers for R and Python but also could be used from KNIME. The results will be written to a folder and the models will be stored in MOJO format to be used in KNIME (as well as on a Big Data cluster via Sparkling Water). One major parameterto set is the running time the model has to test various models and do some hyper parameter optimization as well. The best model of each round is stored and some graphics are produced to see the results.To run this workflow you have to install Python and H2O.ai as well as R and several packages. Please refer to the green box on the right.The results may be used also on Big Data clusters with the help of H2O.ai Sparkling Water (https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_h2o_sparkling_water) which output is there to be interpretedmodel are stored in the folder /model/<full model name>/<model name>.zip-> as MOJO model format (certain model types cannot be stored and reused - so they are excluded as of now)/model/<full model name>/<model name>-> genuine H2O model stored in a folder (can be reused from H2O itself - also could store the Stacked andEnsemble models as well as XGBoost)/model/validate/h2o_list_of_models.csv -> list of all leading model from the runs with their RMSE (among other things) --- individual model results/model/validate/H2O_AutoML_Regression_yyyymmdd_hhmmh.txt-> capture of a print command describing the winning modelmodel_table_H2O_AutoML_Regression_yyyymmdd_hhmmh.table-> a KNIME table with a collection of parameters and information about the modelH2O_AutoML_Regression_yyyymmdd_hhmmh....-> CSVfiles containing important information among these: - _leaderboard = the list of all tested models in the run - _model_summary = the characteristic of the winning model (depth - _variable_importances = !!! check if the variable importance does make senseH2O_AutoML_Regression_yyyymmdd_hhmmh.xlsx-> an Excel file containing important information among these: - model_eval = a check split up into several numeric bins to see if the model does perform across them- Bland_Altman = a Bland-Altman Plot (experimental)- all_stat = summary of statistics---- 4 graphics for each model to have visual support when interpreting the results (needs R)(for more details see /script/H2O.ai AutoML in KNIME for regression problems.pdf)model_graph_H2O_AutoML_Regression_yyyymmdd_hhmmh.png-> two lines set next to each other to represent the deviation in a linear formatmodel_graph_H2O_AutoML_Regression_yyyymmdd_hhmmh_hexbin.png-> a Hexbin Plot giving you a compact idea about the position of prediction (submission) and truth (solution) withregards to big blocks (are the large block positioned where you would like them)model_graph_H2O_AutoML_Regression_yyyymmdd_hhmmh_parallel_plot.png-> a parallel plot to see if there is a trend with regard to certain individual numbersmodel_graph_H2O_AutoML_Regression_yyyymmdd_hhmmh_bias.png-> a Bland-Altman plot input_table_2 <- knime.in# H2O.ai AutoML applied# library(arrow)library(h2o)# ---> EDIT:# workpath_r <- "/Users/m_lauber/Dropbox/knime-workspace/kaggle/census_income_bigdata_40/data/"# workpath_r <- "C:\\Users\\A9699459\\knime-workspace\\kaggle\\census_income_bigdata_40\\data"# workspace_name <- paste0(workpath_r, "workspace_410.RData")# setwd(workpath_r) # Set work directory# http://biostat.mc.vanderbilt.edu/wiki/pub/Main/ColeBeck/datestimes.pdfvar_timestamp_day <- format(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"), format = "%Y%m%d")# print("var_timestamp_day: ", var_timestamp_day)var_timestamp_time <- format(as.POSIXct(Sys.time(), format = "%Y-%m-%d %H:%M"), format = "%H%M")# print("var_timestamp_time: ", var_timestamp_time)# _edit: if you want to have another model namevar_model_name <- "H2O_AutoML_Regression"var_model_name_full <- paste0(var_model_name , "_" , var_timestamp_day , "_" , var_timestamp_time)# print("var_model_name_full: ", var_model_name_full)# import the train and test file from local drive# v_data_path <- c("/Users/m_lauber/Dropbox/knime-workspace/hub/automl/kn_automl_h2o_classification_r/data/")v_data_path <- c(knime.flow.in[["var_path_data"]])# input_table_1 <- as.data.frame(read_parquet(paste0(v_data_path,"train.parquet")))# input_table_2 <- as.data.frame(read_parquet(paste0(v_data_path,"test.parquet")))# _edit:# list variable names to be removed / not being used in model# this is here to demonstrate how this would be donev_remove_variables <- c("Date1", "Location")# grab the columns from the 1st dataframe# x <- colnames(input_table_1)# name the target variablesy = 'Target'v_dropList <- c(v_remove_variables, y)# temporary file that only has the df_x_vars <- head(input_table_1[, !colnames(input_table_1) %in% v_dropList])x<- colnames(df_x_vars)# save and load the working environment# save.image(workspace_name)# load(workspace_name)# h2o.shutdown(prompt = FALSE)h2o.init()# https://forum.knime.com/t/python-script-and-h2o-data-frames-error-under-windows/21099/4?u=mlauber71h2o.no_progress()h2o.clusterStatus()# When launching nodes, we recommend allocating a total of four times the memory of your data.# h2o.init(ice_root=c(workpath_mac))# h2o.init(nthreads=20, max_mem_size="12G")train <- as.h2o(input_table_1)valid <- as.h2o(input_table_2)# https://www.h2o.ai/blog/h2o-release-3-24-yates/# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html# https://www.rdocumentation.org/packages/h2o/versions/3.16.0.2/topics/h2o.automl# https://dzone.com/articles/a-trial-run-with-h2o-automl-automatic-machine-lear# "GLM", "DeepLearning", "DRF", "GBM", "StackedEnsemble", "XGBoost"# "GLM", "DeepLearning", "DRF", "GBM", "StackedEnsemble"# "DeepLearning", "StackedEnsemble", "XGBoost"# http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/automl/autoh2o.html# you could exclude algorithms as they might not be suitable # to be used in Big Data environments# exclude_algos =["GBM", "GLM", "DeepLearning", "DRF", "StackedEnsemble", "XGBoost"]# For binomial classification choose between "AUC", "logloss", "mean_per_class_error", "RMSE", "MSE". # For multinomial classification choose between "mean_per_class_error", "logloss", "RMSE", "MSE". # For regression choose between "deviance", "RMSE", "MSE", "MAE", "RMLSE".# Run AutoML for 60 seconds or# 300 = 5 min, 600 = 10 min, 900 = 15 min, 1800 = 30 min, 3600 = 1 hour, # 7200 = 2 hours# 14400 = 4 hours# 16200 = 4.5 hours# 18000 = 5 Stunden# 21600 = 6 hours# 25200 = 7 hours# 28800 = 8 hours# 36000 = 10 hours# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/sort_metric.html# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/exclude_algos.html# get the maximum runtime from the KNIME workflowmax_runtime_secs_opts <- knime.flow.in[["v_runtime_automl"]]aml <- h2o.automl(x = x, y = y, training_frame = train, validation_frame = valid, balance_classes=FALSE, max_runtime_secs=max_runtime_secs_opts, seed =1234, sort_metric ="RMSE", stopping_metric ="RMSE", stopping_tolerance =0.01, stopping_rounds =25, project_name =var_model_name_full , # exclude_algos =c("GBM", "GLM", "DeepLearning", "DRF", "StackedEnsemble") #, # exclude_algos =c("DRF", "GLM") exclude_algos =c("DeepLearning", "StackedEnsemble", "XGBoost"))# bring the leaderboard to a filelb <- aml@leaderboard# lb <- h2o.get_leaderboard(aml, extra_columns = "ALL")# print(lb)tb_leaderboard <- as.data.frame(lb)# var_selected_model <- "GBM_1_AutoML_20191214_123545"var_selected_model <- aml@leader@model_id# print("var_selected_model :", var_selected_model)# get the extracted modelextracted_model <- h2o.getModel(var_selected_model)# extract important tables from model to store latertb_variable_importances <- h2o.varimp(extracted_model)tb_model_summary <- as.data.frame(extracted_model@allparameters)# print(tb_variable_importances)# var_path_model <- c("/Users/m_lauber/Dropbox/knime-workspace/hub/automl/kn_automl_h2o_classification_r/model/")var_path_model <- c(knime.flow.in[["var_path_model"]])# var_path_validate <- c("/Users/m_lauber/Dropbox/knime-workspace/hub/automl/kn_automl_h2o_classification_r/model/validate/")var_path_validate <- c(knime.flow.in[["var_path_validate"]])v_csv_file_variable_importance <- paste0(var_path_validate, var_model_name_full, "_variable_importance.csv")write.table(tb_variable_importances, v_csv_file_variable_importance , sep="|")var_model_name_path <- paste0(var_path_model, var_model_name_full)# just save as 'normal' H2O Model - you can easily use later again with R# h2o.saveModel(aml@leader, path=c(var_loc_mojo_model), force=TRUE)h2o.saveModel(extracted_model, path=c(var_model_name_path), force=TRUE)# save the best model as MOJO model (seems to be a problem bringing it back from MOJO)# it seems you can save an ensemble but not bring it back, but anywayh2o.saveMojo(extracted_model, path = c(var_model_name_path), force=TRUE)var_model_file_name <- paste0(var_path_model, var_model_name_full, knime.flow.in[["file.separator"]] , var_selected_model)var_mojo_file_name <- paste0(var_path_model, var_model_name_full, knime.flow.in[["file.separator"]] , var_selected_model, ".zip")# saved_h2o_model<- h2o.loadModel(var_model_file_name)saved_mojo_model <- h2o.import_mojo(var_mojo_file_name)# https://www.kaggle.com/sramirez/deep-learning-in-r-with-h2otest_prediction <- as.data.frame(h2o.predict(saved_mojo_model, valid, type = 'raw'))output_table_2 <- cbind(input_table_2, test_prediction)v_csv_file_leaderboard <- paste0(var_path_validate, var_model_name_full, "_leaderboard.csv")write.table(tb_leaderboard, v_csv_file_leaderboard , sep="|")v_csv_file_model_summary <- paste0(var_path_validate, var_model_name_full, "_model_summary.csv")write.table(tb_model_summary, v_csv_file_model_summary , sep="|")var_txt_summary = paste0(var_path_validate, var_model_name_full, ".txt")sink(var_txt_summary)print(extracted_model)sink()# ------------------------------------------------------------------------# export the necessary variables as flow variables to KNIME# https://forum.knime.com/t/creating-flow-variables-in-r-scripting-nodes/5701/2?u=mlauber71# https://forum.knime.com/t/creating-flow-variables-with-r-snippet-works-but/9007?u=mlauber71knime.flow.out <- list(var_timestamp_day = var_timestamp_day, var_timestamp_time = var_timestamp_time, var_model_name = var_model_name, var_model_name_full = var_model_name_full, x = paste(c(x), collapse=', ' ), y = y, max_runtime_secs_opts = max_runtime_secs_opts, var_path_model = var_path_model, var_mojo_file_name = var_mojo_file_name, v_csv_file_variable_importance = v_csv_file_variable_importance, v_csv_file_leaderboard = v_csv_file_leaderboard, v_csv_file_model_summary = v_csv_file_model_summary, var_txt_summary = var_txt_summary)# shut down the H2O cluster in the end# h2o.shutdown(prompt = FALSE) Subfolders to check/data/ contains the original data/model/contains the stored models in MOJO and H2O format/model/validate/contains the validations and graphics/script/'pure' R code (if you do not wish to use the KNIME wrapper)kn_automl_h2o_regression_r.Ra PDF with further informations about the methods usedH2O.ai AutoML in KNIME for regression problems.pdf # make sure you have R and the necessary R packages installed, also check aout the pdf in /script/# Install R alongside KNIME on Windows and MacOS# https://forum.knime.com/t/install-r-alongside-knime-on-windows-and-macos/13287# R and Rtools# https://forum.knime.com/t/how-to-import-tables-from-docx-documents-via-r-snippet/19284/10# RServe 1.8.6 on MacOSX# https://forum.knime.com/t/installing-rserve-1-8-6-on-macos-10-15-catalina/20909/6?u=mlauber71library(h2o)# if you wish to use the 'pure' R code and import the data with parquetlibrary(arrow) R packages needed:ggplot2, lift, hexbin, scalesIf you use the R wrapper you will need the h2o package and the arrow package if you plan on using the pure Rscript in the /script/ subfolderhttp://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html Inspect the models so far and see to results. This will also give you a quick idea where you stand and what you would be able to achieve.Along with all parameters to load the respective model. House Prices - Advanced Regression TechniquesPredict sales prices and practice feature engineering, RFs, and gradient boostinghttps://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/evaluationMetricSubmissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted valueand the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses andcheap houses will affect the result equally.) NUM2.15train.tabletest.tablevar_model_name_fullRMSE ASC^(.*submission|solution).*$solution to doublekeep best modelinput_table_2and mainR codewrapperinput_table_1scored outputcreate initial Test andTraining dataKaggle House Pricesedit: v_runtime_automlset the maximum runtime ofH2O.ai AutoMLin secondsno pathsh2o_list_of_models.csvRead VariableimportanceRead the MOJOmodelRead the MOJOmodelScore the test tableyou might also use a third table to validatethat has not been used developing themodelextract parametersfrom Rwhich have been usedto calculate the modelh2o_list_of_models.csvappend if CSV already exists to collect allmodel runsModel QualityNumeric - Graphics Table Reader Table Reader Numeric Scorer Transpose Joiner ConstantValue Column Column Resorter RowID Sorter Column Rename Column Filter Math Formula Row Filter Add Table To R Table to R R to Table Test Training Integer Input collect meta data Merge Variables Column Filter CSV Reader Table Rowto Variable Column Filter CSV Reader String to Path(Variable) H2O MOJO Reader String to Path(Variable) H2O MOJO Reader H2O MOJO Predictor(Regression) Variable toTable Row CSV Writer

Nodes

Extensions

Links