Icon

s_​618_​h2o_​automl_​spark

s_618 - H2O.ai AutoML (generic KNIME nodes) in KNIME for classification problems - a powerful auto-machine-learning framework applied via Sparkling Water on a Big Data system

s_618 - H2O.ai AutoML (generic KNIME nodes) in KNIME for classification problems - a powerful auto-machine-learning framework applied via Sparkling Water on a Big Data system

It features various models like Random Forest along with Deep Learning. The results will be written to a folder and the models will be stored in MOJO format to be used in KNIME (as well as on a Big Data cluster via Sparkling Water). One major parameter to set is the running time the model has to test various models and do some hyper parameter optimization as well. The best model of each round is stored and some graphics are produced to see the results.
To run the validations in this workflow you have to install R with several packages or use the Conda Environment Propagation provided. Please refer to the green box on the right.

The results may be used also on Big Data clusters with the help of H2O.ai Sparkling Water (https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_h2o_sparkling_water). Also shown in the s_620 node in this collection

https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_h2o_automl_spark_46

# Run AutoML for 60 seconds or# 300 = 5 min, 600 = 10 min, 900 = 15 min, 1800 = 30 min, 3600 = 1 hour, # 7200 = 2 hours# 14400 = 4 hours# 16200 = 4.5 hours# 18000 = 5 Stunden# 21600 = 6 hours# 25200 = 7 hours# 28800 = 8 hours# 36000 = 10 hours s_618 - H2O.ai AutoML (generic KNIME nodes) in KNIME for classification problems - a powerful auto-machine-learning framework applied via Sparkling Water on a Big Data systemIt features various models like Random Forest along with Deep Learning. The results will be written to a folder and the models will be stored in MOJO format to be used in KNIME (as well as on a Big Data cluster via Sparkling Water). One major parameter to set is the running time the model has to test various models and do some hyper parameter optimization as well. The best model of each round is stored and some graphics are produced to see the results.To run the validations in this workflow you have to install R with several packages or use the Conda Environment Propagation provided. Please refer to the green box on the right.The results may be used also on Big Data clusters with the help of H2O.ai Sparkling Water (https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_h2o_sparkling_water). Also shown in the s_620 node in this collectionhttps://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_h2o_automl_spark_46 which output is there to be interpretedmodel are stored in the folder /model/<full model name>/<model name>.zip-> as MOJO model format (certain model types cannot be stored and reused - so they are excluded as of now)/model/validate/h2o_list_of_models.csv -> list of all leading model from the runs with their RMSE (among other things) --- individual model results/model/validate/model_table_H2O_AutoML_Classification_yyyymmdd_hhmmh.table-> a KNIME table with a collection of parameters and information about the modelH2O_AutoML_Classification_yyyymmdd_hhmmh_-> CSVfiles containing important information among these: - _leaderboard = the list of all tested models in the runH2O_AutoML_Classification_yyyymmdd_hhmmh.xlsx-> an Excel file containing important information among these:- model_cutoff = check the best cutoff for your business case (depending on the number of different scores you will get the scores rounded to 0.1 or 0.10) (max_cohens_kappa = based on best Cohen's Cappa, max_f_measure = based on best F1 score)- model_cutoff_overview = compact overview of cut-off results---- 4 graphics for each model to have visual support when interpreting the results (needs R)model_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_cutoff.png-> a graphic illustrating the consequences of two possible cut-offs (with statistics). Please note depending on your business needs you might choose completely different onesmodel_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_roc.png-> a classic ROC (receiver operating characteristic) curve with statisticsmodel_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_lift.png-> a classic lift curve with statistics. Illustrating how the TOP 10% of your score are doing compared to the restmodel_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_ks.png-> two curves illustrating the Kolmogorov-Smirnov Goodness-of-Fit Test Subfolders to check../data/ contains the original data../model/contains the stored models in MOJO and H2O format../model/validate/contains the validations and graphics../script/a PDF with further informations about the methods usedH2O.ai AutoML in KNIME for classification problems.pdf # you can use the Conda Environment Propagation providing an R installation# make sure you have R and the necessary R packages installed, also check out the pdf in /script/https://hub.knime.com/mlauber71/spaces/Public/latest/_r_installation_on_knime_collection~tj5tS_6gYvqOSPlk# Install R alongside KNIME on Windows and MacOS# https://forum.knime.com/t/install-r-alongside-knime-on-windows-and-macos/13287# R and Rtools# https://forum.knime.com/t/how-to-import-tables-from-docx-documents-via-r-snippet/19284/10# RServe 1.8.6+ on MacOSX# https://forum.knime.com/t/installing-rserve-1-8-6-on-macos-10-15-catalina/20909/6?u=mlauber71# if you wish to use the 'pure' R code and import the data with parquetlibrary(arrow) additional R packages needed:ggplot2, lift, reshape2http://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html "The poor man's ML Ops" - store the SQL rules, the Regexstring to select the final variables the final model and the evaluation in one folder with a unique ID Do the H2O.ai Automl on Spark Cluster Propagate R environmentfor KNIME on MacOS withMinicondaconfigure how to handle the environmentdefault = just check the namesvar_model_pathH2O on Spark Cluster^(?!customer_number$).*H2O on Spark Clusteredit: v_runtime_automlset the maximum runtime ofH2O.ai AutoML in SECONDSexclude pathsh2o_list_of_models.csvappend if CSV already exists to collect allmodel runsRead the MOJOmodelyou could check out this nodevar_model_name_fullsolutionto string^(.*submission|solution).*$LogLosswrite the mojo modelFirst row (best model)var_model_pathvar_model_name_fullvar_leaderboard_pathvar_leaderboard_pathLeaderboardvar_leaderboard_pathv_model_pathbinary classification modelswith RPropagate R environmentfor KNIME on Windows withMiniforgeconfigure how to handle the environmentdefault = just check the names../data/data_70_file.parquet../data/data_30_file.parquetSTART of Error catchingEND of Error catchingEND of Error catchingDROP TABLE IF EXISTS default.data_70;data_70data_70data_70data_30Score the test tableyou might also use a third table to validatethat has not been used developing themodelfetch all rowsREFRESH TABLE default.data_70;REFRESH TABLE #table#=> make sure the Sparkenvironment 'knows' about the tablelist allrelevant filesselect only thefiles with the Model ID$Location$ LIKE $${Spattern_model_search}$$ => TRUE$Location$ LIKE "*model/nvl_*" => TRUE$Location$ LIKE "*model/spark_*" => TRUE$Location$ LIKE "*model/d_regex*" => TRUEpattern_model_searchsearch for all filescontaining the Model IDcopy files to sub-folderclear thesubfolderif existsfolder_pathcreate an emptysub-foldercreate a subfolder variable../model/<model ID>=> create a local big data contexif you encouter any problems, closeKNIME and delete all data from the folder/big_data/ and start overH2O on local machinedata_70H2O on local machine^(?!customer_number$).*REFRESH TABLE #table#=> make sure the Sparkenvironment 'knows' about the tableknime_r_environment Java EditVariable (simple) Create H2O SparklingWater Context Hive to Spark Spark to H2O Integer Input collect meta data Merge Variables RowID Column Filter CSV Writer Column Resorter H2O MOJO Reader Binary ClassificationInspector ROC Curve (local) ConstantValue Column Number To String Column Filter Column Rename H2O AutoML Learner H2O Model to MOJO H2O MOJO Writer Row Filter Table Rowto Variable String to Path(Variable) String Manipulation Joiner Java EditVariable (simple) String to Path(Variable) CSV Writer ConstantValue Column Merge Variables ConstantValue Column Model QualityClassification - Graphics knime_r_environment_windows Parquet Reader Parquet Reader Try (VariablePorts) Catch Errors(Var Ports) Merge Variables DB SQL Executor DB Table Creator DB Loader DB Table Selector Table to Spark Spark H2O MOJO Predictor(Classification) Spark to Table DB SQL Executor Spark SQL Query List Files/Folders Rule-basedRow Filter Path to String Java EditVariable (simple) Transfer Files(Table) DeleteFiles/Folders Create Folder Create File/FolderVariables local big datacontext create H2O Local Context Table to H2O Column Filter Spark SQL Query # Run AutoML for 60 seconds or# 300 = 5 min, 600 = 10 min, 900 = 15 min, 1800 = 30 min, 3600 = 1 hour, # 7200 = 2 hours# 14400 = 4 hours# 16200 = 4.5 hours# 18000 = 5 Stunden# 21600 = 6 hours# 25200 = 7 hours# 28800 = 8 hours# 36000 = 10 hours s_618 - H2O.ai AutoML (generic KNIME nodes) in KNIME for classification problems - a powerful auto-machine-learning framework applied via Sparkling Water on a Big Data systemIt features various models like Random Forest along with Deep Learning. The results will be written to a folder and the models will be stored in MOJO format to be used in KNIME (as well as on a Big Data cluster via Sparkling Water). One major parameter to set is the running time the model has to test various models and do some hyper parameter optimization as well. The best model of each round is stored and some graphics are produced to see the results.To run the validations in this workflow you have to install R with several packages or use the Conda Environment Propagation provided. Please refer to the green box on the right.The results may be used also on Big Data clusters with the help of H2O.ai Sparkling Water (https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_h2o_sparkling_water). Also shown in the s_620 node in this collectionhttps://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_h2o_automl_spark_46 which output is there to be interpretedmodel are stored in the folder /model/<full model name>/<model name>.zip-> as MOJO model format (certain model types cannot be stored and reused - so they are excluded as of now)/model/validate/h2o_list_of_models.csv -> list of all leading model from the runs with their RMSE (among other things) --- individual model results/model/validate/model_table_H2O_AutoML_Classification_yyyymmdd_hhmmh.table-> a KNIME table with a collection of parameters and information about the modelH2O_AutoML_Classification_yyyymmdd_hhmmh_-> CSVfiles containing important information among these: - _leaderboard = the list of all tested models in the runH2O_AutoML_Classification_yyyymmdd_hhmmh.xlsx-> an Excel file containing important information among these:- model_cutoff = check the best cutoff for your business case (depending on the number of different scores you will get the scores rounded to 0.1 or 0.10) (max_cohens_kappa = based on best Cohen's Cappa, max_f_measure = based on best F1 score)- model_cutoff_overview = compact overview of cut-off results---- 4 graphics for each model to have visual support when interpreting the results (needs R)model_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_cutoff.png-> a graphic illustrating the consequences of two possible cut-offs (with statistics). Please note depending on your business needs you might choose completely different onesmodel_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_roc.png-> a classic ROC (receiver operating characteristic) curve with statisticsmodel_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_lift.png-> a classic lift curve with statistics. Illustrating how the TOP 10% of your score are doing compared to the restmodel_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_ks.png-> two curves illustrating the Kolmogorov-Smirnov Goodness-of-Fit Test Subfolders to check../data/ contains the original data../model/contains the stored models in MOJO and H2O format../model/validate/contains the validations and graphics../script/a PDF with further informations about the methods usedH2O.ai AutoML in KNIME for classification problems.pdf # you can use the Conda Environment Propagation providing an R installation# make sure you have R and the necessary R packages installed, also check out the pdf in /script/https://hub.knime.com/mlauber71/spaces/Public/latest/_r_installation_on_knime_collection~tj5tS_6gYvqOSPlk# Install R alongside KNIME on Windows and MacOS# https://forum.knime.com/t/install-r-alongside-knime-on-windows-and-macos/13287# R and Rtools# https://forum.knime.com/t/how-to-import-tables-from-docx-documents-via-r-snippet/19284/10# RServe 1.8.6+ on MacOSX# https://forum.knime.com/t/installing-rserve-1-8-6-on-macos-10-15-catalina/20909/6?u=mlauber71# if you wish to use the 'pure' R code and import the data with parquetlibrary(arrow) additional R packages needed:ggplot2, lift, reshape2http://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html "The poor man's ML Ops" - store the SQL rules, the Regexstring to select the final variables the final model and the evaluation in one folder with a unique ID Do the H2O.ai Automl on Spark Cluster Propagate R environmentfor KNIME on MacOS withMinicondaconfigure how to handle the environmentdefault = just check the namesvar_model_pathH2O on Spark Cluster^(?!customer_number$).*H2O on Spark Clusteredit: v_runtime_automlset the maximum runtime ofH2O.ai AutoML in SECONDSexclude pathsh2o_list_of_models.csvappend if CSV already exists to collect allmodel runsRead the MOJOmodelyou could check out this nodevar_model_name_fullsolutionto string^(.*submission|solution).*$LogLosswrite the mojo modelFirst row (best model)var_model_pathvar_model_name_fullvar_leaderboard_pathvar_leaderboard_pathLeaderboardvar_leaderboard_pathv_model_pathbinary classification modelswith RPropagate R environmentfor KNIME on Windows withMiniforgeconfigure how to handle the environmentdefault = just check the names../data/data_70_file.parquet../data/data_30_file.parquetSTART of Error catchingEND of Error catchingEND of Error catchingDROP TABLE IF EXISTS default.data_70;data_70data_70data_70data_30Score the test tableyou might also use a third table to validatethat has not been used developing themodelfetch all rowsREFRESH TABLE default.data_70;REFRESH TABLE #table#=> make sure the Sparkenvironment 'knows' about the tablelist allrelevant filesselect only thefiles with the Model ID$Location$ LIKE $${Spattern_model_search}$$ => TRUE$Location$ LIKE "*model/nvl_*" => TRUE$Location$ LIKE "*model/spark_*" => TRUE$Location$ LIKE "*model/d_regex*" => TRUEpattern_model_searchsearch for all filescontaining the Model IDcopy files to sub-folderclear thesubfolderif existsfolder_pathcreate an emptysub-foldercreate a subfolder variable../model/<model ID>=> create a local big data contexif you encouter any problems, closeKNIME and delete all data from the folder/big_data/ and start overH2O on local machinedata_70H2O on local machine^(?!customer_number$).*REFRESH TABLE #table#=> make sure the Sparkenvironment 'knows' about the tableknime_r_environment Java EditVariable (simple) Create H2O SparklingWater Context Hive to Spark Spark to H2O Integer Input collect meta data Merge Variables RowID Column Filter CSV Writer Column Resorter H2O MOJO Reader Binary ClassificationInspector ROC Curve (local) ConstantValue Column Number To String Column Filter Column Rename H2O AutoML Learner H2O Model to MOJO H2O MOJO Writer Row Filter Table Rowto Variable String to Path(Variable) String Manipulation Joiner Java EditVariable (simple) String to Path(Variable) CSV Writer ConstantValue Column Merge Variables ConstantValue Column Model QualityClassification - Graphics knime_r_environment_windows Parquet Reader Parquet Reader Try (VariablePorts) Catch Errors(Var Ports) Merge Variables DB SQL Executor DB Table Creator DB Loader DB Table Selector Table to Spark Spark H2O MOJO Predictor(Classification) Spark to Table DB SQL Executor Spark SQL Query List Files/Folders Rule-basedRow Filter Path to String Java EditVariable (simple) Transfer Files(Table) DeleteFiles/Folders Create Folder Create File/FolderVariables local big datacontext create H2O Local Context Table to H2O Column Filter Spark SQL Query

Nodes

Extensions

Links