Icon

kn_​automl_​h2o_​classification_​python

H2O.ai AutoML (wrapped with Python) in KNIME for classification problems

H2O.ai AutoML (wrapped with Python) in KNIME for classification problems - a powerful auto-machine-learning framework (https://hub.knime.com/mlauber71/spaces/Public/latest/automl/)
v 1.90

It features various models like Random Forest or XGBoost along with Deep Learning. It has warppers for R and Python but also could be used from KNIME. The results will be written to a folder and the models will be stored in MOJO format to be used in KNIME (as well as on a Big Data cluster via Sparkling Water). One major parameter to set is the running time the model has to test various models and do some hyper parameter optimization as well. The best model of each round is stored and some graphics are produced to see the results.

To run this workflow you have to install Python (https://medium.com/p/2ac217792539) and H2O.ai as well as R (https://medium.com/p/6494a2a498cc) and several packages. Please refer to the green box on the right.

The results may be used also on Big Data clusters with the help of H2O.ai Sparkling Water (https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_h2o_sparkling_water)

# Run AutoML for 60 seconds or# 300 = 5 min, 600 = 10 min, 900 = 15 min, 1800 = 30 min, 3600 = 1 hour, # 7200 = 2 hours# 14400 = 4 hours# 16200 = 4.5 hours# 18000 = 5 Stunden# 21600 = 6 hours# 25200 = 7 hours# 28800 = 8 hours# 36000 = 10 hours H2O.ai AutoML (wrapped with Python) in KNIME for classification problems - a powerful auto-machine-learning framework (https://hub.knime.com/mlauber71/spaces/Public/latest/automl/)v 1.90It features various models like Random Forest or XGBoost along with Deep Learning. It has warppers for R and Python but also could be used from KNIME. The results will be written to a folder and the models will be stored in MOJO format to be used in KNIME (as well as on a Big Data cluster via Sparkling Water). One major parameter to set is the running time the model has to test various models and do some hyper parameter optimization as well. The best model of each round is stored and some graphics areproduced to see the results.To run this workflow you have to install Python (https://medium.com/p/2ac217792539) and H2O.ai as well as R (https://medium.com/p/6494a2a498cc) and several packages. Please refer to the green box on the right.The results may be used also on Big Data clusters with the help of H2O.ai Sparkling Water (https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_h2o_sparkling_water) which output is there to be interpretedmodel are stored in the folder /model/H2O_AutoML_Classification_yyyymmdd_hhmmh_....zip-> as MOJO model format (certain model types cannot be stored and reused - so they are excluded as of now)H2O_AutoML_Classification_yyyymmdd_hhmmh_... (folder)-> genuine H2O model stored in a folder (can be reused from H2O itself)/model/validate/h2o_list_of_models.csv -> list of all leading model from the runs with their RMSE (among other things) --- individual model results/model/validate/H2O_AutoML_Classification_yyyymmdd_hhmmh.txt-> capture of a print command describing the winning modelmodel_table_H2O_AutoML_Classification_yyyymmdd_hhmmh.table-> a KNIME table with a collection of parameters and information about the modelH2O_AutoML_Classification_yyyymmdd_hhmmh.xlsx-> an Excel file containing important information among these: - leaderboard = the list of all tested models in the run - model_summary = the characteristic of the winning model (depth - variable_importances = !!! check if the variable importance does make sense- model_cutoff = check the best cutoff for your business case (depending on the number of different scores you will get the scores rounded to 0.1 or 0.10) (max_cohens_kappa = based on best Cohen's Cappa, max_f_measure = based on best F1 score)- model_cutoff_overview = compact overview of cut-off results---- 4 graphics for each model to have visual support when interpreting the results (needs R)model_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_cutoff.png-> a graphic illustrating the consequences of two possible cut-offs (with statistics). Please note depending on your business needs you might choose completely different onesmodel_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_roc.png-> a classic ROC (receiver operating characteristic) curve with statisticsmodel_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_lift.png-> a classic lift curve with statistics. Illustrating how the TOP 10% of your score are doing compared to the restmodel_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_ks.png-> two curves illustrating the Kolmogorov-Smirnov Goodness-of-Fit Test Subfolders to check/data/ contains the original data/model/contains the stored models in MOJO and H2O format/model/validate/contains the validations and graphics/script/Jupyter notebook with 'pure' python scrip (if you do not wish to use the KNIMEwrapper) kn_automl_h2o_classification_python.ipynba PDF with further informations about the methods usedH2O.ai AutoML in KNIME for classification problems.pdf import knime.scripting.io as knioimport numpy as np # linear algebraimport os # accessing directory structureimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)input_table_1 = knio.input_tables[0].to_pandas()input_table_2 = knio.input_tables[1].to_pandas()# http://strftime.org'import timevar_timestamp_day = "{}".format(time.strftime("%Y%m%d"))knio.flow_variables['var_timestamp_day'] = var_timestamp_dayprint("var_timestamp_day: ", var_timestamp_day)var_timestamp_time = "{}h".format(time.strftime("%H%M"))knio.flow_variables['var_timestamp_time'] = var_timestamp_timeprint("var_timestamp_time: ", var_timestamp_time)# _edit: if you want to have another model namevar_model_name = "H2O_AutoML_Classification"knio.flow_variables['var_model_name'] = var_model_namevar_model_name_full = var_model_name + "_" + var_timestamp_day + "_" + var_timestamp_timeknio.flow_variables['var_model_name_full'] = var_model_name_fullprint("var_model_name_full: ", var_model_name_full)# df_train = input_table_1.copy()# df_validate = input_table_2.copy()# https://stackoverflow.com/questions/36268749/remove-multiple-items-from-a-python-list-in-just-one-statement# _edit:# manually enter variables you want to remove# this can help if you have a large number of variables you want to keep # and a few you want removes# \ connects the rows# the examples are there to demonstrate the usagev_remove_variables = {'Date1', \'Location', 'RISK_MM' \ }# grab the columns from the 1st dataframex = input_table_1.columns# name the target variablesy = 'Target'# drop the target variable from the list of all variablesx = x.drop(y)# remove all variables you want to have removes from the listx = [e for e in x if e not in v_remove_variables]# see which variables we have selected in the end# print('x = ', x)knio.flow_variables['var_x_values'] = x# print('y = ', y)knio.flow_variables['var_y_values'] = y# initiate h2o# if it is already running it will cconnect to the running clusterimport h2ofrom h2o.automl import H2OAutoMLh2o.init()# https://forum.knime.com/t/python-script-and-h2o-data-frames-error-under-windows/21099/4?u=mlauber71h2o.no_progress()# import the df data into H2O data systemtrain = h2o.H2OFrame(input_table_1.copy())valid = h2o.H2OFrame(input_table_2.copy())# if it is a classification task make sure the Target is a factortrain[y] = train[y].asfactor()valid[y] = valid[y].asfactor()# Run AutoML for 60 seconds or# 300 = 5 min, 600 = 10 min, 900 = 15 min, 1800 = 30 min, 3600 = 1 hour, # 7200 = 2 hours# 14400 = 4 hours# 16200 = 4.5 hours# 18000 = 5 Stunden# 21600 = 6 hours# 25200 = 7 hours# 28800 = 8 hours# 36000 = 10 hours# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/sort_metric.html# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/exclude_algos.html# get the maximum runtime from the KNIME workflowmax_runtime_secs_opts = knio.flow_variables['v_runtime_automl']import timeimport datetime as dtfrom datetime import datevar_now = dt.datetime.now()var_startmodel_day = "{}".format(var_now.strftime("%Y%m%d"))print("var_startmodel_day: ", var_timestamp_day)var_startmodel_time = "{}h".format(var_now.strftime("%H%M"))print("var_startmodel_time: ", var_timestamp_time)v_endtime = var_now + dt.timedelta(seconds=max_runtime_secs_opts)var_endmodel_day = "{}".format(v_endtime.strftime("%Y%m%d"))print("var_endmodel_day: ", var_endmodel_day)var_endmodel_time = "{}h".format(v_endtime.strftime("%H%M"))print("var_endmodel_time: ", var_endmodel_time)# you could exclude algorithms as they might not be suitable eg. for export as MOJO files# or to be used in Big Data environments# exclude_algos =["GBM", "GLM", "DeepLearning", "DRF", "StackedEnsemble", "XGBoost"]# For binomial classification choose between "AUC", "logloss", "mean_per_class_error", "RMSE", "MSE". # For multinomial classification choose between "mean_per_class_error", "logloss", "RMSE", "MSE". # For regression choose between "deviance", "RMSE", "MSE", "MAE", "RMLSE".aml = H2OAutoML(max_runtime_secs = max_runtime_secs_opts, seed =1234, sort_metric ="AUC", stopping_metric ="AUC", stopping_tolerance =0.01, stopping_rounds =25, project_name =var_model_name_full , # exclude_algos =["GBM", "GLM", "DeepLearning", "DRF", "StackedEnsemble"] #, # exclude_algos =["DRF", "GLM"] exclude_algos =["DeepLearning", "StackedEnsemble", "XGBoost"] )# x - all our variables we want to use to explain the:# y - Target Variable - in this case "Emp UK percent"aml.train(x = x, y = y, training_frame = train, validation_frame = valid) # View the AutoML Leaderboardlb = aml.leaderboardtb_leaderboard = lb.as_data_frame(use_pandas=True, header=True)# var_selected_model = "GBM_1_AutoML_20191214_123545"var_selected_model = aml.leader.model_id# print("var_selected_model :", var_selected_model)knio.flow_variables['var_selected_model'] = var_selected_model# get the extracted modelextracted_model = h2o.get_model(var_selected_model)# extract important tables from model to store latertb_variable_importances = extracted_model._model_json['output']['variable_importances'].as_data_frame()tb_model_summary = extracted_model._model_json['output']['model_summary'].as_data_frame()# print(tb_variable_importances)# Export the variable importance list# _edit:# var_path_validate = "../model/validate/"v_csv_file_variable_importance = knio.flow_variables['var_path_validate'] + var_model_name_full + "_variable_importance.csv"tb_variable_importances.to_csv(v_csv_file_variable_importance, sep='|', encoding='utf-8')knio.flow_variables['v_csv_file_variable_importance'] = v_csv_file_variable_importance# predict the validation data with the non-MOJO saved model# preds = extracted_model.predict(valid)# save the model as generic H2O modelvar_model_name_path = knio.flow_variables['var_path_model'] + var_model_name_full + "_" + var_selected_model knio.flow_variables['var_model_name_path'] = var_model_name_pathmodel_path = h2o.save_model(model=extracted_model, path=var_model_name_path , force=True)# load the model# saved_model = h2o.load_model(model_path)# save the model as MOJO which you could read back in with KNIMEvar_mojo_file_name = knio.flow_variables['var_path_model'] + var_model_name_full + "_" + var_selected_model + ".zip"knio.flow_variables['var_mojo_file_name'] = var_mojo_file_nameprint("var_mojo_file_name: ", var_mojo_file_name)# reload the saved MOJO modelextracted_model.download_mojo(var_mojo_file_name)saved_mojo_model = h2o.import_mojo(var_mojo_file_name)# the prediction on the validation dataset will be brought back to KNIMEoutput_predict = saved_mojo_model.predict(valid).as_data_frame()# some important tables will be stored in an Excel file# -------- START Excel-----------------------------------------------------------from pandas import ExcelWriterfrom pandas import ExcelFilevar_xlsx_summary = knio.flow_variables['var_path_validate'] + var_model_name_full + ".xlsx"knio.flow_variables['var_xlsx_summary'] = var_xlsx_summaryraw_data = {'Model_ID': [var_model_name_full], 'Selected Model Name': [var_selected_model], }df_id = pd.DataFrame(raw_data, columns = ['Model_ID', 'Selected Model Name'])df_id# https://stackoverflow.com/questions/42370977/how-to-save-a-new-sheet-in-an-existing-excel-file-using-pandas/42371251# writer = pd.ExcelWriter(var_xlsx_summary, engine = 'xlsxwriter')writer = pd.ExcelWriter(var_xlsx_summary)df_id.to_excel(writer, sheet_name = 'summary')tb_leaderboard.to_excel(writer, sheet_name = 'leaderboard')tb_model_summary.to_excel(writer, sheet_name = 'model_summary')tb_variable_importances.to_excel(writer, sheet_name = 'variable_importances')writer.save()writer.close()# -------- END Excel-----------------------------------------------------------var_txt_summary = knio.flow_variables['var_path_validate'] + var_model_name_full + ".txt"knio.flow_variables['var_txt_summary'] = var_txt_summary# capture the model summary in an TXT file# -------- START summary output to txt -----------------------------------------------------------import syssys.stdout = open(var_txt_summary, 'w')print(extracted_model)sys.stdout.close()# -------- END summary output to txt -----------------------------------------------------------# ------ store Python package versions in KNIME flow variablesknio.flow_variables['var_py_version'] = sys.version_infoknio.flow_variables['var_py_version_pandas'] = pd.__version__knio.flow_variables['var_py_version_h2o'] = h2o.__version__# 1st output is the LeaderBoard to see where the automation stands# and what alternatives were thereoutput_table_1 = tb_leaderboard.copy()# 2nd Output is the new predicion. Make sure the prediction is saved as Double / Float variableoutput_table_2 = pd.concat([valid.as_data_frame(), output_predict], axis=1)knio.output_tables[0] = knio.Table.from_pandas(output_table_1)knio.output_tables[1] = knio.Table.from_pandas(output_table_2) # make sure you have Python and the necessary Python packages installed, also check aout the pdf in /script/# https://docs.knime.com/latest/python_installation_guide/index.htmlimport numpy as np # linear algebraimport os # accessing directory structureimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)print("pandas (pd) version: ", pd.__version__)print("numpy (np) version", np.__version__)# http://strftime.org'import timeimport datetime as dt# install specific number# conda install -c conda-forge pyarrow=0.15.# conda install -c conda-forge pyarrowimport pyarrow.parquet as pq# pip install -f https://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2oimport h2oprint("numpy (np) version", h2o.__version__)from pandas import ExcelWriterfrom pandas import ExcelFileimport sys Inspect the models so far and see to results. This will also give you a quick idea where you stand and what you would be able to achieve.Along with all parameters to load the respective model. KNIME and R — installation across operating systems — some remarkshttps://medium.com/p/6494a2a498cc under Apple silicon currently not all R packages would workwith this propagation Propagate Python environmentfor KNIME on MacOSX withMiniforge / Minicondaconfigure how to handle the environmentdefault = just check the namesedit: v_runtime_automlset the maximum runtime ofH2O.ai AutoML in SECONDSvar_model_name_full^(.*submission|solution).*$Score the test tableyou might also use a third table to validatethat has not been used developing themodelsolutionto stringyou could check out this nodeRead the MOJOmodel from the/model/ pathextract parametersfrom Pythonwhich have been usedto calculate the modelh2o_list_of_models.csvappend if CSV already exists to collect allmodel runsexclude pathsh2o_list_of_models.csvRead VariableimportanceAUC DESCkeep best modelRead the MOJOmodelcreate initial Test andTraining dataCensus incomeclassificationPropagate Python environmentfor KNIME on Windows withMiniforge / Minicondaconfigure how to handle the environmentdefault = just check the namestrain.tabletest.tablePropagate R environmentfor KNIME on MacOS withMinicondaconfigure how to handle the environmentdefault = just check the namesbinary classification modelswith R (use local R installation)https://medium.com/p/6494a2a498ccPropagate R environmentfor KNIME on Windows withMinicondaconfigure how to handle the environmentdefault = just check the namesH2O.ai AutoML=> will start a H2Ocluster via Pythonin the backgroundedit: metric used andalgorithms excludedPropagate Python environmentfor KNIME on MacOSX (Apple Scilicon)with Miniforge / Minicondaconfigure how to handle the environmentdefault = just check the namesPropagate R environmentfor KNIME on MacOS with(Apple Scilicon)Miniforge / Minicondaconfigure how to handle the environmentdefault = just check the namesfor Apple silicon you might have to manuallyinstall "RServe" currentlybinary classification modelswith R conda_environment_kaggle_macosx Integer Input(legacy) collect meta data ConstantValue Column Column Resorter RowID Column Rename Column Filter H2O MOJO Predictor(Classification) ROC Curve (local) Number To String Binary ClassificationInspector Merge Variables H2O MOJO Reader String to Path(Variable) Variable toTable Row CSV Writer Column Filter CSV Reader Table Rowto Variable CSV Reader String to Path(Variable) Sorter Row Filter Column Filter H2O MOJO Reader Test Training conda_environment_kaggle_windows Table Reader Table Reader knime_r_environment Model Quality Classification- Graphics (local R) knime_r_environment_windows Joiner Python Script conda_environment_kaggle_apple_silicon knime_r_environment_apple_silicon Model QualityClassification - Graphics # Run AutoML for 60 seconds or# 300 = 5 min, 600 = 10 min, 900 = 15 min, 1800 = 30 min, 3600 = 1 hour, # 7200 = 2 hours# 14400 = 4 hours# 16200 = 4.5 hours# 18000 = 5 Stunden# 21600 = 6 hours# 25200 = 7 hours# 28800 = 8 hours# 36000 = 10 hours H2O.ai AutoML (wrapped with Python) in KNIME for classification problems - a powerful auto-machine-learning framework (https://hub.knime.com/mlauber71/spaces/Public/latest/automl/)v 1.90It features various models like Random Forest or XGBoost along with Deep Learning. It has warppers for R and Python but also could be used from KNIME. The results will be written to a folder and the models will be stored in MOJO format to be used in KNIME (as well as on a Big Data cluster via Sparkling Water). One major parameter to set is the running time the model has to test various models and do some hyper parameter optimization as well. The best model of each round is stored and some graphics areproduced to see the results.To run this workflow you have to install Python (https://medium.com/p/2ac217792539) and H2O.ai as well as R (https://medium.com/p/6494a2a498cc) and several packages. Please refer to the green box on the right.The results may be used also on Big Data clusters with the help of H2O.ai Sparkling Water (https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_h2o_sparkling_water) which output is there to be interpretedmodel are stored in the folder /model/H2O_AutoML_Classification_yyyymmdd_hhmmh_....zip-> as MOJO model format (certain model types cannot be stored and reused - so they are excluded as of now)H2O_AutoML_Classification_yyyymmdd_hhmmh_... (folder)-> genuine H2O model stored in a folder (can be reused from H2O itself)/model/validate/h2o_list_of_models.csv -> list of all leading model from the runs with their RMSE (among other things) --- individual model results/model/validate/H2O_AutoML_Classification_yyyymmdd_hhmmh.txt-> capture of a print command describing the winning modelmodel_table_H2O_AutoML_Classification_yyyymmdd_hhmmh.table-> a KNIME table with a collection of parameters and information about the modelH2O_AutoML_Classification_yyyymmdd_hhmmh.xlsx-> an Excel file containing important information among these: - leaderboard = the list of all tested models in the run - model_summary = the characteristic of the winning model (depth - variable_importances = !!! check if the variable importance does make sense- model_cutoff = check the best cutoff for your business case (depending on the number of different scores you will get the scores rounded to 0.1 or 0.10) (max_cohens_kappa = based on best Cohen's Cappa, max_f_measure = based on best F1 score)- model_cutoff_overview = compact overview of cut-off results---- 4 graphics for each model to have visual support when interpreting the results (needs R)model_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_cutoff.png-> a graphic illustrating the consequences of two possible cut-offs (with statistics). Please note depending on your business needs you might choose completely different onesmodel_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_roc.png-> a classic ROC (receiver operating characteristic) curve with statisticsmodel_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_lift.png-> a classic lift curve with statistics. Illustrating how the TOP 10% of your score are doing compared to the restmodel_graph_H2O_AutoML_Class_yyyymmdd_hhmmh_ks.png-> two curves illustrating the Kolmogorov-Smirnov Goodness-of-Fit Test Subfolders to check/data/ contains the original data/model/contains the stored models in MOJO and H2O format/model/validate/contains the validations and graphics/script/Jupyter notebook with 'pure' python scrip (if you do not wish to use the KNIMEwrapper)kn_automl_h2o_classification_python.ipynba PDF with further informations about the methods usedH2O.ai AutoML in KNIME for classification problems.pdf import knime.scripting.io as knioimport numpy as np # linear algebraimport os # accessing directory structureimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)input_table_1 = knio.input_tables[0].to_pandas()input_table_2 = knio.input_tables[1].to_pandas()# http://strftime.org'import timevar_timestamp_day = "{}".format(time.strftime("%Y%m%d"))knio.flow_variables['var_timestamp_day'] = var_timestamp_dayprint("var_timestamp_day: ", var_timestamp_day)var_timestamp_time = "{}h".format(time.strftime("%H%M"))knio.flow_variables['var_timestamp_time'] = var_timestamp_timeprint("var_timestamp_time: ", var_timestamp_time)# _edit: if you want to have another model namevar_model_name = "H2O_AutoML_Classification"knio.flow_variables['var_model_name'] = var_model_namevar_model_name_full = var_model_name + "_" + var_timestamp_day + "_" + var_timestamp_timeknio.flow_variables['var_model_name_full'] = var_model_name_fullprint("var_model_name_full: ", var_model_name_full)# df_train = input_table_1.copy()# df_validate = input_table_2.copy()# https://stackoverflow.com/questions/36268749/remove-multiple-items-from-a-python-list-in-just-one-statement# _edit:# manually enter variables you want to remove# this can help if you have a large number of variables you want to keep # and a few you want removes# \ connects the rows# the examples are there to demonstrate the usagev_remove_variables = {'Date1', \'Location', 'RISK_MM' \ }# grab the columns from the 1st dataframex = input_table_1.columns# name the target variablesy = 'Target'# drop the target variable from the list of all variablesx = x.drop(y)# remove all variables you want to have removes from the listx = [e for e in x if e not in v_remove_variables]# see which variables we have selected in the end# print('x = ', x)knio.flow_variables['var_x_values'] = x# print('y = ', y)knio.flow_variables['var_y_values'] = y# initiate h2o# if it is already running it will cconnect to the running clusterimport h2ofrom h2o.automl import H2OAutoMLh2o.init()# https://forum.knime.com/t/python-script-and-h2o-data-frames-error-under-windows/21099/4?u=mlauber71h2o.no_progress()# import the df data into H2O data systemtrain = h2o.H2OFrame(input_table_1.copy())valid = h2o.H2OFrame(input_table_2.copy())# if it is a classification task make sure the Target is a factortrain[y] = train[y].asfactor()valid[y] = valid[y].asfactor()# Run AutoML for 60 seconds or# 300 = 5 min, 600 = 10 min, 900 = 15 min, 1800 = 30 min, 3600 = 1 hour, # 7200 = 2 hours# 14400 = 4 hours# 16200 = 4.5 hours# 18000 = 5 Stunden# 21600 = 6 hours# 25200 = 7 hours# 28800 = 8 hours# 36000 = 10 hours# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/sort_metric.html# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/exclude_algos.html# get the maximum runtime from the KNIME workflowmax_runtime_secs_opts = knio.flow_variables['v_runtime_automl']import timeimport datetime as dtfrom datetime import datevar_now = dt.datetime.now()var_startmodel_day = "{}".format(var_now.strftime("%Y%m%d"))print("var_startmodel_day: ", var_timestamp_day)var_startmodel_time = "{}h".format(var_now.strftime("%H%M"))print("var_startmodel_time: ", var_timestamp_time)v_endtime = var_now + dt.timedelta(seconds=max_runtime_secs_opts)var_endmodel_day = "{}".format(v_endtime.strftime("%Y%m%d"))print("var_endmodel_day: ", var_endmodel_day)var_endmodel_time = "{}h".format(v_endtime.strftime("%H%M"))print("var_endmodel_time: ", var_endmodel_time)# you could exclude algorithms as they might not be suitable eg. for export as MOJO files# or to be used in Big Data environments# exclude_algos =["GBM", "GLM", "DeepLearning", "DRF", "StackedEnsemble", "XGBoost"]# For binomial classification choose between "AUC", "logloss", "mean_per_class_error", "RMSE", "MSE". # For multinomial classification choose between "mean_per_class_error", "logloss", "RMSE", "MSE". # For regression choose between "deviance", "RMSE", "MSE", "MAE", "RMLSE".aml = H2OAutoML(max_runtime_secs = max_runtime_secs_opts, seed =1234, sort_metric ="AUC", stopping_metric ="AUC", stopping_tolerance =0.01, stopping_rounds =25, project_name =var_model_name_full , # exclude_algos =["GBM", "GLM", "DeepLearning", "DRF", "StackedEnsemble"] #, # exclude_algos =["DRF", "GLM"] exclude_algos =["DeepLearning", "StackedEnsemble", "XGBoost"] )# x - all our variables we want to use to explain the:# y - Target Variable - in this case "Emp UK percent"aml.train(x = x, y = y, training_frame = train, validation_frame = valid) # View the AutoML Leaderboardlb = aml.leaderboardtb_leaderboard = lb.as_data_frame(use_pandas=True, header=True)# var_selected_model = "GBM_1_AutoML_20191214_123545"var_selected_model = aml.leader.model_id# print("var_selected_model :", var_selected_model)knio.flow_variables['var_selected_model'] = var_selected_model# get the extracted modelextracted_model = h2o.get_model(var_selected_model)# extract important tables from model to store latertb_variable_importances = extracted_model._model_json['output']['variable_importances'].as_data_frame()tb_model_summary = extracted_model._model_json['output']['model_summary'].as_data_frame()# print(tb_variable_importances)# Export the variable importance list# _edit:# var_path_validate = "../model/validate/"v_csv_file_variable_importance = knio.flow_variables['var_path_validate'] + var_model_name_full + "_variable_importance.csv"tb_variable_importances.to_csv(v_csv_file_variable_importance, sep='|', encoding='utf-8')knio.flow_variables['v_csv_file_variable_importance'] = v_csv_file_variable_importance# predict the validation data with the non-MOJO saved model# preds = extracted_model.predict(valid)# save the model as generic H2O modelvar_model_name_path = knio.flow_variables['var_path_model'] + var_model_name_full + "_" + var_selected_model knio.flow_variables['var_model_name_path'] = var_model_name_pathmodel_path = h2o.save_model(model=extracted_model, path=var_model_name_path , force=True)# load the model# saved_model = h2o.load_model(model_path)# save the model as MOJO which you could read back in with KNIMEvar_mojo_file_name = knio.flow_variables['var_path_model'] + var_model_name_full + "_" + var_selected_model + ".zip"knio.flow_variables['var_mojo_file_name'] = var_mojo_file_nameprint("var_mojo_file_name: ", var_mojo_file_name)# reload the saved MOJO modelextracted_model.download_mojo(var_mojo_file_name)saved_mojo_model = h2o.import_mojo(var_mojo_file_name)# the prediction on the validation dataset will be brought back to KNIMEoutput_predict = saved_mojo_model.predict(valid).as_data_frame()# some important tables will be stored in an Excel file# -------- START Excel-----------------------------------------------------------from pandas import ExcelWriterfrom pandas import ExcelFilevar_xlsx_summary = knio.flow_variables['var_path_validate'] + var_model_name_full + ".xlsx"knio.flow_variables['var_xlsx_summary'] = var_xlsx_summaryraw_data = {'Model_ID': [var_model_name_full], 'Selected Model Name': [var_selected_model], }df_id = pd.DataFrame(raw_data, columns = ['Model_ID', 'Selected Model Name'])df_id# https://stackoverflow.com/questions/42370977/how-to-save-a-new-sheet-in-an-existing-excel-file-using-pandas/42371251# writer = pd.ExcelWriter(var_xlsx_summary, engine = 'xlsxwriter')writer = pd.ExcelWriter(var_xlsx_summary)df_id.to_excel(writer, sheet_name = 'summary')tb_leaderboard.to_excel(writer, sheet_name = 'leaderboard')tb_model_summary.to_excel(writer, sheet_name = 'model_summary')tb_variable_importances.to_excel(writer, sheet_name = 'variable_importances')writer.save()writer.close()# -------- END Excel-----------------------------------------------------------var_txt_summary = knio.flow_variables['var_path_validate'] + var_model_name_full + ".txt"knio.flow_variables['var_txt_summary'] = var_txt_summary# capture the model summary in an TXT file# -------- START summary output to txt -----------------------------------------------------------import syssys.stdout = open(var_txt_summary, 'w')print(extracted_model)sys.stdout.close()# -------- END summary output to txt -----------------------------------------------------------# ------ store Python package versions in KNIME flow variablesknio.flow_variables['var_py_version'] = sys.version_infoknio.flow_variables['var_py_version_pandas'] = pd.__version__knio.flow_variables['var_py_version_h2o'] = h2o.__version__# 1st output is the LeaderBoard to see where the automation stands# and what alternatives were thereoutput_table_1 = tb_leaderboard.copy()# 2nd Output is the new predicion. Make sure the prediction is saved as Double / Float variableoutput_table_2 = pd.concat([valid.as_data_frame(), output_predict], axis=1)knio.output_tables[0] = knio.Table.from_pandas(output_table_1)knio.output_tables[1] = knio.Table.from_pandas(output_table_2) # make sure you have Python and the necessary Python packages installed, also check aout the pdf in /script/# https://docs.knime.com/latest/python_installation_guide/index.htmlimport numpy as np # linear algebraimport os # accessing directory structureimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)print("pandas (pd) version: ", pd.__version__)print("numpy (np) version", np.__version__)# http://strftime.org'import timeimport datetime as dt# install specific number# conda install -c conda-forge pyarrow=0.15.# conda install -c conda-forge pyarrowimport pyarrow.parquet as pq# pip install -f https://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2oimport h2oprint("numpy (np) version", h2o.__version__)from pandas import ExcelWriterfrom pandas import ExcelFileimport sys Inspect the models so far and see to results. This will also give you a quick idea where you stand and what you would be able to achieve.Along with all parameters to load the respective model. KNIME and R — installation across operating systems — some remarkshttps://medium.com/p/6494a2a498cc under Apple silicon currently not all R packages would workwith this propagation Propagate Python environmentfor KNIME on MacOSX withMiniforge / Minicondaconfigure how to handle the environmentdefault = just check the namesedit: v_runtime_automlset the maximum runtime ofH2O.ai AutoML in SECONDSvar_model_name_full^(.*submission|solution).*$Score the test tableyou might also use a third table to validatethat has not been used developing themodelsolutionto stringyou could check out this nodeRead the MOJOmodel from the/model/ pathextract parametersfrom Pythonwhich have been usedto calculate the modelh2o_list_of_models.csvappend if CSV already exists to collect allmodel runsexclude pathsh2o_list_of_models.csvRead VariableimportanceAUC DESCkeep best modelRead the MOJOmodelcreate initial Test andTraining dataCensus incomeclassificationPropagate Python environmentfor KNIME on Windows withMiniforge / Minicondaconfigure how to handle the environmentdefault = just check the namestrain.tabletest.tablePropagate R environmentfor KNIME on MacOS withMinicondaconfigure how to handle the environmentdefault = just check the namesbinary classification modelswith R (use local R installation)https://medium.com/p/6494a2a498ccPropagate R environmentfor KNIME on Windows withMinicondaconfigure how to handle the environmentdefault = just check the namesH2O.ai AutoML=> will start a H2Ocluster via Pythonin the backgroundedit: metric used andalgorithms excludedPropagate Python environmentfor KNIME on MacOSX (Apple Scilicon)with Miniforge / Minicondaconfigure how to handle the environmentdefault = just check the namesPropagate R environmentfor KNIME on MacOS with(Apple Scilicon)Miniforge / Minicondaconfigure how to handle the environmentdefault = just check the namesfor Apple silicon you might have to manuallyinstall "RServe" currentlybinary classification modelswith Rconda_environment_kaggle_macosx Integer Input(legacy) collect meta data ConstantValue Column Column Resorter RowID Column Rename Column Filter H2O MOJO Predictor(Classification) ROC Curve (local) Number To String Binary ClassificationInspector Merge Variables H2O MOJO Reader String to Path(Variable) Variable toTable Row CSV Writer Column Filter CSV Reader Table Rowto Variable CSV Reader String to Path(Variable) Sorter Row Filter Column Filter H2O MOJO Reader Test Training conda_environment_kaggle_windows Table Reader Table Reader knime_r_environment Model Quality Classification- Graphics (local R) knime_r_environment_windows Joiner Python Script conda_environment_kaggle_apple_silicon knime_r_environment_apple_silicon Model QualityClassification - Graphics

Nodes

Extensions

Links