Icon

kn_​example_​python_​xgboost

use Python XGBoost package to build model and deploy that thru KNIME Python nodes

use Python XGBoost package to build model and deploy that thru KNIME Python nodes
in the subfolder /data/ there is a Jupyter notebook to experiment and build XGBoost models ("kn_example_python_xgboost.ipynb")


--------
Census Income Data Set
Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Adult" dataset.

Extract and prepare the Census Income Files for usage in KNIME

https://archive.ics.uci.edu/ml/datasets/census+income

use Python XGBoost package to build model and deploy that thru KNIME Python nodesin the subfolder /data/ there is a Jupyter notebook to experiment and build XGBoost models ("kn_example_python_xgboost.ipynb")https://forum.knime.com/t/saving-xgboost-model-to-pmml-possible-now/45057/5?u=mlauber71 in the subfolder /data/ there is a Jupyter notebook to experiment and build XGBoost models("kn_example_python_xgboost.ipynb") import knime.scripting.io as knioimport xgboost as xgbimport pandas as pdfrom sklearn.model_selection import train_test_splitdata = knio.input_tables[0].to_pandas()var_json_model_file = knio.flow_variables['v_model_json_file']# exclude columns like IDs, customer numbers etcexcluded_features = ['row_id']# define the Target variablelabel = ['Target']# features are the columns that would go towards the modelfeatures = [feat for feat in data.columns if feat not in excluded_features and feat not in label]num_cols = data[features].select_dtypes(include='number').columns.tolist()cat_cols = data[features].select_dtypes(exclude='number').columns.tolist()# XGBoost likes the string columns as categoriesdata[cat_cols] = data[cat_cols].astype('category')# not sure why but it does only work when converting the Target/label to integer# the Target should be 0/1data[label] = data[label].astype(int)X = data[features]y = data[label]# split data into train and test setsseed = 7test_size = 0.33X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)# you have to convert the data into matrices that can be read by XGBoost# enable_categorical=True is an experimental featureD_train = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)D_test = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)# if you choose binary:logistic the output probability seems to have a problem# therefore multitarget with only twi targets is used# https://xgboost.readthedocs.io/en/stable/parameter.htmlparam = {'eta': 0.3, 'max_depth': 7, 'objective': 'multi:softprob', 'num_class': 2} steps = 1000 # The number of training iterations# train the modelmodel = xgb.train(param, D_train, steps)# save the model as JSON filemodel.save_model(var_json_model_file)# This example script simply outputs the node's input table# KNIME has some problems with various data types# pred_apply = model.predict(D_test)# df_pred_apply = pd.DataFrame(pred_apply, columns = ['P0','P1'])# the index will have to be rest for this one# output_table = pd.concat([data, df_pred_apply], axis=1)# knio.output_tables[0] = knio.write_table(output_table) This is what deployment would look like on new data import knime.scripting.io as knioimport xgboost as xgbimport pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_split# it is necessary to rest the index to an integer so later the resilts can be concatenated togetherdata = knio.input_tables[0].to_pandas()data = data.reset_index(drop=True)# the path of the XGBosst model in JSONvar_json_model_file = knio.flow_variables['v_model_json_file']# print(data.dtypes)excluded_features = ['row_id']label = ['Target']# features = [feat for feat in data.columns if feat not in excluded_features and not feat==label]features = [feat for feat in data.columns if feat not in excluded_features and feat not in label]num_cols = data[features].select_dtypes(include='number').columns.tolist()cat_cols = data[features].select_dtypes(exclude='number').columns.tolist()data[cat_cols] = data[cat_cols].astype('category')data[label] = data[label].astype(int)# import the XGBoost model from /data/ foldermodel_xgb_2 = xgb.Booster()model_xgb_2.load_model(var_json_model_file)X2 = data[features]X2[cat_cols] = X2[cat_cols].astype('category')D_apply = xgb.DMatrix(X2, enable_categorical=True)# make the prediction and convert the results to data frame with two columns P0 and P1pred_apply = model_xgb_2.predict(D_apply)df_pred_apply = pd.DataFrame(pred_apply, columns = ['P0','P1'])# print(df_pred_apply.dtypes)data_export = data[['Target', 'row_id']]data_export[label] = data_export[label].astype(str)output_table_2 = pd.concat([data_export, df_pred_apply], axis=1)number_cols = output_table_2.select_dtypes(include='number').columns.tolist()output_table_2[number_cols] = output_table_2[number_cols].astype(np.float64)knio.output_tables[0] = knio.Table.from_pandas(output_table_2) Census Income Data SetAbstract: Predict whether income exceeds $50K/yr based on censusdata. Also known as "Adult" dataset.Extract and prepare the Census Income Files for usage in KNIMEhttps://archive.ics.uci.edu/ml/datasets/census+income train.parquetsplit trainingagain 70/30Node 3Node 4knime_xgboost_model.zipconvert to PMMLNode 8jupyter_test_prediction.parquetNode 10Node 11test.parquetNode 13knime_xgboost_model.zipknime_model_gbm.pmmlknime_model_gbm.pmmlNode 17Node 18Node 22build theXGBoost modeland store it in/data/as JSON fileedit the number of StepsPropagate Python environmentfor KNIME on Windows withMiniforge / Minicondaconfigure how to handle the environmentdefault = just check the namesv_model_json_file"knime_xgboost_model.json"Py_XGBoostNode 516AUC DESCmodelApply the XGBoost model from"knime_xgboost_model.json"keeping the original "Target"from the test fileevaluate the XGBoost modelApply the XGBoost model from"knime_xgboost_model.json"drop Targetas if this werecompletely new dataexport FlowVariables from KNIME^(?!knime.workspace).*$Propagate Python environmentfor KNIME on MacOSX withMiniforge / Minicondaconfigure how to handle the environmentdefault = just check the namesPropagate Python environmentfor KNIME on MacOSX (Apple Scilicon)with Miniforge / Minicondaconfigure how to handle the environmentdefault = just check the nameslocate and create/data/ folderwith absolute pathsdetermine package versionsGBMXGBoost Parquet Reader Partitioning XGBoost Predictor XGBoost TreeEnsemble Learner Model Writer Gradient BoostedTrees to PMML Gradient BoostedTrees Learner Parquet Reader Binary ClassificationInspector Binary ClassificationInspector Parquet Reader XGBoost Predictor Model Reader PMML Writer PMML Reader PMML Gradient BoostedTrees Predictor Binary ClassificationInspector Binary ClassificationInspector Python Script conda_environment_kaggle_windows Java EditVariable (simple) ConstantValue Column Concatenate Sorter RowID Merge Variables Python Script Binary ClassificationInspector Python Script Column Filter Variable toTable Row conda_environment_kaggle_macosx conda_environment_kaggle_apple_silicon Collect LocalMetadata Python Script Transpose ConstantValue Column ConstantValue Column use Python XGBoost package to build model and deploy that thru KNIME Python nodesin the subfolder /data/ there is a Jupyter notebook to experiment and build XGBoost models ("kn_example_python_xgboost.ipynb")https://forum.knime.com/t/saving-xgboost-model-to-pmml-possible-now/45057/5?u=mlauber71 in the subfolder /data/ there is a Jupyter notebook to experiment and build XGBoost models("kn_example_python_xgboost.ipynb") import knime.scripting.io as knioimport xgboost as xgbimport pandas as pdfrom sklearn.model_selection import train_test_splitdata = knio.input_tables[0].to_pandas()var_json_model_file = knio.flow_variables['v_model_json_file']# exclude columns like IDs, customer numbers etcexcluded_features = ['row_id']# define the Target variablelabel = ['Target']# features are the columns that would go towards the modelfeatures = [feat for feat in data.columns if feat not in excluded_features and feat not in label]num_cols = data[features].select_dtypes(include='number').columns.tolist()cat_cols = data[features].select_dtypes(exclude='number').columns.tolist()# XGBoost likes the string columns as categoriesdata[cat_cols] = data[cat_cols].astype('category')# not sure why but it does only work when converting the Target/label to integer# the Target should be 0/1data[label] = data[label].astype(int)X = data[features]y = data[label]# split data into train and test setsseed = 7test_size = 0.33X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)# you have to convert the data into matrices that can be read by XGBoost# enable_categorical=True is an experimental featureD_train = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)D_test = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)# if you choose binary:logistic the output probability seems to have a problem# therefore multitarget with only twi targets is used# https://xgboost.readthedocs.io/en/stable/parameter.htmlparam = {'eta': 0.3, 'max_depth': 7, 'objective': 'multi:softprob', 'num_class': 2} steps = 1000 # The number of training iterations# train the modelmodel = xgb.train(param, D_train, steps)# save the model as JSON filemodel.save_model(var_json_model_file)# This example script simply outputs the node's input table# KNIME has some problems with various data types# pred_apply = model.predict(D_test)# df_pred_apply = pd.DataFrame(pred_apply, columns = ['P0','P1'])# the index will have to be rest for this one# output_table = pd.concat([data, df_pred_apply], axis=1)# knio.output_tables[0] = knio.write_table(output_table) This is what deployment would look like on new data import knime.scripting.io as knioimport xgboost as xgbimport pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_split# it is necessary to rest the index to an integer so later the resilts can be concatenated togetherdata = knio.input_tables[0].to_pandas()data = data.reset_index(drop=True)# the path of the XGBosst model in JSONvar_json_model_file = knio.flow_variables['v_model_json_file']# print(data.dtypes)excluded_features = ['row_id']label = ['Target']# features = [feat for feat in data.columns if feat not in excluded_features and not feat==label]features = [feat for feat in data.columns if feat not in excluded_features and feat not in label]num_cols = data[features].select_dtypes(include='number').columns.tolist()cat_cols = data[features].select_dtypes(exclude='number').columns.tolist()data[cat_cols] = data[cat_cols].astype('category')data[label] = data[label].astype(int)# import the XGBoost model from /data/ foldermodel_xgb_2 = xgb.Booster()model_xgb_2.load_model(var_json_model_file)X2 = data[features]X2[cat_cols] = X2[cat_cols].astype('category')D_apply = xgb.DMatrix(X2, enable_categorical=True)# make the prediction and convert the results to data frame with two columns P0 and P1pred_apply = model_xgb_2.predict(D_apply)df_pred_apply = pd.DataFrame(pred_apply, columns = ['P0','P1'])# print(df_pred_apply.dtypes)data_export = data[['Target', 'row_id']]data_export[label] = data_export[label].astype(str)output_table_2 = pd.concat([data_export, df_pred_apply], axis=1)number_cols = output_table_2.select_dtypes(include='number').columns.tolist()output_table_2[number_cols] = output_table_2[number_cols].astype(np.float64)knio.output_tables[0] = knio.Table.from_pandas(output_table_2) Census Income Data SetAbstract: Predict whether income exceeds $50K/yr based on censusdata. Also known as "Adult" dataset.Extract and prepare the Census Income Files for usage in KNIMEhttps://archive.ics.uci.edu/ml/datasets/census+income train.parquetsplit trainingagain 70/30Node 3Node 4knime_xgboost_model.zipconvert to PMMLNode 8jupyter_test_prediction.parquetNode 10Node 11test.parquetNode 13knime_xgboost_model.zipknime_model_gbm.pmmlknime_model_gbm.pmmlNode 17Node 18Node 22build theXGBoost modeland store it in/data/as JSON fileedit the number of StepsPropagate Python environmentfor KNIME on Windows withMiniforge / Minicondaconfigure how to handle the environmentdefault = just check the namesv_model_json_file"knime_xgboost_model.json"Py_XGBoostNode 516AUC DESCmodelApply the XGBoost model from"knime_xgboost_model.json"keeping the original "Target"from the test fileevaluate the XGBoost modelApply the XGBoost model from"knime_xgboost_model.json"drop Targetas if this werecompletely new dataexport FlowVariables from KNIME^(?!knime.workspace).*$Propagate Python environmentfor KNIME on MacOSX withMiniforge / Minicondaconfigure how to handle the environmentdefault = just check the namesPropagate Python environmentfor KNIME on MacOSX (Apple Scilicon)with Miniforge / Minicondaconfigure how to handle the environmentdefault = just check the nameslocate and create/data/ folderwith absolute pathsdetermine package versionsGBMXGBoostParquet Reader Partitioning XGBoost Predictor XGBoost TreeEnsemble Learner Model Writer Gradient BoostedTrees to PMML Gradient BoostedTrees Learner Parquet Reader Binary ClassificationInspector Binary ClassificationInspector Parquet Reader XGBoost Predictor Model Reader PMML Writer PMML Reader PMML Gradient BoostedTrees Predictor Binary ClassificationInspector Binary ClassificationInspector Python Script conda_environment_kaggle_windows Java EditVariable (simple) ConstantValue Column Concatenate Sorter RowID Merge Variables Python Script Binary ClassificationInspector Python Script Column Filter Variable toTable Row conda_environment_kaggle_macosx conda_environment_kaggle_apple_silicon Collect LocalMetadata Python Script Transpose ConstantValue Column ConstantValue Column

Nodes

Extensions

Links