Icon

kn_​example_​python_​xgboost

use Python XGBoost package to build model and deploy that thru KNIME Python nodes

use Python XGBoost package to build model and deploy that thru KNIME Python nodes
in the subfolder /data/ there is a Jupyter notebook to experiment and build XGBoost models ("kn_example_python_xgboost.ipynb")


--------
Census Income Data Set
Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Adult" dataset.

Extract and prepare the Census Income Files for usage in KNIME

https://archive.ics.uci.edu/ml/datasets/census+income

use Python XGBoost package to build model and deploy that thru KNIME Python nodesin the subfolder /data/ there is a Jupyter notebook to experiment and build XGBoost models ("kn_example_python_xgboost.ipynb")https://forum.knime.com/t/saving-xgboost-model-to-pmml-possible-now/45057/5?u=mlauber71 in the subfolder /data/ there is a Jupyter notebook to experiment and build XGBoost models("kn_example_python_xgboost.ipynb") import knime.scripting.io as knioimport xgboost as xgbimport pandas as pdfrom sklearn.model_selection import train_test_splitdata = knio.input_tables[0].to_pandas()var_json_model_file = knio.flow_variables['v_model_json_file']# exclude columns like IDs, customer numbers etcexcluded_features = ['row_id']# define the Target variablelabel = ['Target']# features are the columns that would go towards the modelfeatures = [feat for feat in data.columns if feat not in excluded_features and feat not in label]num_cols = data[features].select_dtypes(include='number').columns.tolist()cat_cols = data[features].select_dtypes(exclude='number').columns.tolist()# XGBoost likes the string columns as categoriesdata[cat_cols] = data[cat_cols].astype('category')# not sure why but it does only work when converting the Target/label to integer# the Target should be 0/1data[label] = data[label].astype(int)X = data[features]y = data[label]# split data into train and test setsseed = 7test_size = 0.33X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)# you have to convert the data into matrices that can be read by XGBoost# enable_categorical=True is an experimental featureD_train = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)D_test = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)# if you choose binary:logistic the output probability seems to have a problem# therefore multitarget with only twi targets is used# https://xgboost.readthedocs.io/en/stable/parameter.html# Define parametersparam = {'eta': 0.3, # Learning rate'lambda': 1, # L2 regularization term on weights'alpha': 1, # L1 regularization term on weights'gamma': 0, # Minimum loss reduction required to make a further partition on a leaf node'max_delta_step': 0, # Maximum delta step we allow each tree's weight estimation to be'booster': 'gbtree', # Specify which booster to use: gbtree, gblinear or dart'max_depth': 7, # Maximum depth of a tree'min_child_weight': 1, # Minimum sum of instance weight(hessian) needed in a child'tree_method': 'auto', # Tree construction algorithm used in XGBoost'sketch_eps': 0.03, # Used for sketching the data, particularly in approximate algorithms'scale_pos_weight': 1, # Control the balance of positive and negative weights'grow_policy': 'depthwise', # Controls a way new nodes are added to the tree'max_leaves': 0, # Maximum number of leaves, 0 indicates no limit'max_bin': 256, # Number of bins for histogram construction'sample_type': 'uniform', # Type of sampling algorithm'normalize_type': 'tree', # Type of normalization algorithm'rate_drop': 0, # Dropout rate'one_drop': False, # Whether to drop at least one tree in each boosting round'skip_drop': 0, # Probability of skipping the dropout procedure during a boosting iteration'subsample': 1, # Subsample ratio of the training instances'colsample_bytree': 1, # Subsample ratio of columns when constructing each tree'colsample_bylevel': 1, # Subsample ratio of columns for each level'colsample_bynode': 1, # Subsample ratio of columns for each split'objective': 'multi:softprob', # Specify the learning task and the corresponding learning objective'num_class': 2, # Number of classes'base_score': 0.5 # The initial prediction score of all instances, global bias}steps = 1000 # The number of training iterations# Assuming D_train is already defined as the DMatrix object for training datamodel = xgb.train(param, D_train, steps)steps = 1000 # The number of training iterations# train the modelmodel = xgb.train(param, D_train, steps)# save the model as JSON filemodel.save_model(var_json_model_file)# This example script simply outputs the node's input table# KNIME has some problems with various data types# pred_apply = model.predict(D_test)# df_pred_apply = pd.DataFrame(pred_apply, columns = ['P0','P1'])# the index will have to be rest for this one# output_table = pd.concat([data, df_pred_apply], axis=1)# knio.output_tables[0] = knio.write_table(output_table) This is what deployment would look like on new data import knime.scripting.io as knioimport xgboost as xgbimport pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_split# it is necessary to rest the index to an integer so later the resilts can be concatenated togetherdata = knio.input_tables[0].to_pandas()data = data.reset_index(drop=True)# the path of the XGBosst model in JSONvar_json_model_file = knio.flow_variables['v_model_json_file']# print(data.dtypes)excluded_features = ['row_id']label = ['Target']# features = [feat for feat in data.columns if feat not in excluded_features and not feat==label]features = [feat for feat in data.columns if feat not in excluded_features and feat not in label]num_cols = data[features].select_dtypes(include='number').columns.tolist()cat_cols = data[features].select_dtypes(exclude='number').columns.tolist()data[cat_cols] = data[cat_cols].astype('category')data[label] = data[label].astype(int)# import the XGBoost model from /data/ foldermodel_xgb_2 = xgb.Booster()model_xgb_2.load_model(var_json_model_file)X2 = data[features]X2[cat_cols] = X2[cat_cols].astype('category')D_apply = xgb.DMatrix(X2, enable_categorical=True)# make the prediction and convert the results to data frame with two columns P0 and P1pred_apply = model_xgb_2.predict(D_apply)df_pred_apply = pd.DataFrame(pred_apply, columns = ['P0','P1'])# print(df_pred_apply.dtypes)data_export = data[['Target', 'row_id']]data_export[label] = data_export[label].astype(str)output_table_2 = pd.concat([data_export, df_pred_apply], axis=1)number_cols = output_table_2.select_dtypes(include='number').columns.tolist()output_table_2[number_cols] = output_table_2[number_cols].astype(np.float64)knio.output_tables[0] = knio.Table.from_pandas(output_table_2) Census Income Data SetAbstract: Predict whether income exceeds $50K/yr based on censusdata. Also known as "Adult" dataset.Extract and prepare the Census Income Files for usage in KNIMEhttps://archive.ics.uci.edu/ml/datasets/census+income train.parquetNode 4knime_xgboost_model.zipconvert to PMMLNode 8jupyter_test_prediction.parquetNode 10test.parquetNode 13knime_xgboost_model.zipknime_model_gbm.pmmlknime_model_gbm.pmmlNode 17Node 18Node 22build theXGBoost modeland store it in/data/as JSON fileedit the number of Stepsv_model_json_file"knime_xgboost_model.json"Py_XGBoostcollect resultsAUC DESCmodelPropagate Python environmentfor KNIME on MacOSX (Apple Scilicon)OR Windowswith Miniforge / Minicondaconfigure how to handle the environmentdefault = just check the namesApply the XGBoost model from"knime_xgboost_model.json"keeping the original "Target"from the test fileevaluate the XGBoost modelApply the XGBoost model from"knime_xgboost_model.json"drop Targetas if this werecompletely new dataexport FlowVariables from KNIME^(?!knime.workspace).*$locate and create/data/ folderwith absolute pathsdetermine package versionsGBMXGBoostrow_id excludedtype Parquet Reader XGBoost TreeEnsemble Learner Model Writer Gradient BoostedTrees to PMML Gradient BoostedTrees Learner Parquet Reader Binary ClassificationInspector Parquet Reader XGBoost Predictor Model Reader PMML Writer PMML Reader PMML Gradient BoostedTrees Predictor Binary ClassificationInspector Binary ClassificationInspector Python Script Java EditVariable (simple) ConstantValue Column Concatenate Sorter RowID conda_environment_kaggle_knime4 Merge Variables Python Script Binary ClassificationInspector Python Script Column Filter Variable toTable Row Collect LocalMetadata Python Script Transpose ConstantValue Column ConstantValue Column Column Filter Rule Engine use Python XGBoost package to build model and deploy that thru KNIME Python nodesin the subfolder /data/ there is a Jupyter notebook to experiment and build XGBoost models ("kn_example_python_xgboost.ipynb")https://forum.knime.com/t/saving-xgboost-model-to-pmml-possible-now/45057/5?u=mlauber71 in the subfolder /data/ there is a Jupyter notebook to experiment and build XGBoost models("kn_example_python_xgboost.ipynb") import knime.scripting.io as knioimport xgboost as xgbimport pandas as pdfrom sklearn.model_selection import train_test_splitdata = knio.input_tables[0].to_pandas()var_json_model_file = knio.flow_variables['v_model_json_file']# exclude columns like IDs, customer numbers etcexcluded_features = ['row_id']# define the Target variablelabel = ['Target']# features are the columns that would go towards the modelfeatures = [feat for feat in data.columns if feat not in excluded_features and feat not in label]num_cols = data[features].select_dtypes(include='number').columns.tolist()cat_cols = data[features].select_dtypes(exclude='number').columns.tolist()# XGBoost likes the string columns as categoriesdata[cat_cols] = data[cat_cols].astype('category')# not sure why but it does only work when converting the Target/label to integer# the Target should be 0/1data[label] = data[label].astype(int)X = data[features]y = data[label]# split data into train and test setsseed = 7test_size = 0.33X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)# you have to convert the data into matrices that can be read by XGBoost# enable_categorical=True is an experimental featureD_train = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)D_test = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)# if you choose binary:logistic the output probability seems to have a problem# therefore multitarget with only twi targets is used# https://xgboost.readthedocs.io/en/stable/parameter.html# Define parametersparam = {'eta': 0.3, # Learning rate'lambda': 1, # L2 regularization term on weights'alpha': 1, # L1 regularization term on weights'gamma': 0, # Minimum loss reduction required to make a further partition on a leaf node'max_delta_step': 0, # Maximum delta step we allow each tree's weight estimation to be'booster': 'gbtree', # Specify which booster to use: gbtree, gblinear or dart'max_depth': 7, # Maximum depth of a tree'min_child_weight': 1, # Minimum sum of instance weight(hessian) needed in a child'tree_method': 'auto', # Tree construction algorithm used in XGBoost'sketch_eps': 0.03, # Used for sketching the data, particularly in approximate algorithms'scale_pos_weight': 1, # Control the balance of positive and negative weights'grow_policy': 'depthwise', # Controls a way new nodes are added to the tree'max_leaves': 0, # Maximum number of leaves, 0 indicates no limit'max_bin': 256, # Number of bins for histogram construction'sample_type': 'uniform', # Type of sampling algorithm'normalize_type': 'tree', # Type of normalization algorithm'rate_drop': 0, # Dropout rate'one_drop': False, # Whether to drop at least one tree in each boosting round'skip_drop': 0, # Probability of skipping the dropout procedure during a boosting iteration'subsample': 1, # Subsample ratio of the training instances'colsample_bytree': 1, # Subsample ratio of columns when constructing each tree'colsample_bylevel': 1, # Subsample ratio of columns for each level'colsample_bynode': 1, # Subsample ratio of columns for each split'objective': 'multi:softprob', # Specify the learning task and the corresponding learning objective'num_class': 2, # Number of classes'base_score': 0.5 # The initial prediction score of all instances, global bias}steps = 1000 # The number of training iterations# Assuming D_train is already defined as the DMatrix object for training datamodel = xgb.train(param, D_train, steps)steps = 1000 # The number of training iterations# train the modelmodel = xgb.train(param, D_train, steps)# save the model as JSON filemodel.save_model(var_json_model_file)# This example script simply outputs the node's input table# KNIME has some problems with various data types# pred_apply = model.predict(D_test)# df_pred_apply = pd.DataFrame(pred_apply, columns = ['P0','P1'])# the index will have to be rest for this one# output_table = pd.concat([data, df_pred_apply], axis=1)# knio.output_tables[0] = knio.write_table(output_table) This is what deployment would look like on new data import knime.scripting.io as knioimport xgboost as xgbimport pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_split# it is necessary to rest the index to an integer so later the resilts can be concatenated togetherdata = knio.input_tables[0].to_pandas()data = data.reset_index(drop=True)# the path of the XGBosst model in JSONvar_json_model_file = knio.flow_variables['v_model_json_file']# print(data.dtypes)excluded_features = ['row_id']label = ['Target']# features = [feat for feat in data.columns if feat not in excluded_features and not feat==label]features = [feat for feat in data.columns if feat not in excluded_features and feat not in label]num_cols = data[features].select_dtypes(include='number').columns.tolist()cat_cols = data[features].select_dtypes(exclude='number').columns.tolist()data[cat_cols] = data[cat_cols].astype('category')data[label] = data[label].astype(int)# import the XGBoost model from /data/ foldermodel_xgb_2 = xgb.Booster()model_xgb_2.load_model(var_json_model_file)X2 = data[features]X2[cat_cols] = X2[cat_cols].astype('category')D_apply = xgb.DMatrix(X2, enable_categorical=True)# make the prediction and convert the results to data frame with two columns P0 and P1pred_apply = model_xgb_2.predict(D_apply)df_pred_apply = pd.DataFrame(pred_apply, columns = ['P0','P1'])# print(df_pred_apply.dtypes)data_export = data[['Target', 'row_id']]data_export[label] = data_export[label].astype(str)output_table_2 = pd.concat([data_export, df_pred_apply], axis=1)number_cols = output_table_2.select_dtypes(include='number').columns.tolist()output_table_2[number_cols] = output_table_2[number_cols].astype(np.float64)knio.output_tables[0] = knio.Table.from_pandas(output_table_2) Census Income Data SetAbstract: Predict whether income exceeds $50K/yr based on censusdata. Also known as "Adult" dataset.Extract and prepare the Census Income Files for usage in KNIMEhttps://archive.ics.uci.edu/ml/datasets/census+income train.parquetNode 4knime_xgboost_model.zipconvert to PMMLNode 8jupyter_test_prediction.parquetNode 10test.parquetNode 13knime_xgboost_model.zipknime_model_gbm.pmmlknime_model_gbm.pmmlNode 17Node 18Node 22build theXGBoost modeland store it in/data/as JSON fileedit the number of Stepsv_model_json_file"knime_xgboost_model.json"Py_XGBoostcollect resultsAUC DESCmodelPropagate Python environmentfor KNIME on MacOSX (Apple Scilicon)OR Windowswith Miniforge / Minicondaconfigure how to handle the environmentdefault = just check the namesApply the XGBoost model from"knime_xgboost_model.json"keeping the original "Target"from the test fileevaluate the XGBoost modelApply the XGBoost model from"knime_xgboost_model.json"drop Targetas if this werecompletely new dataexport FlowVariables from KNIME^(?!knime.workspace).*$locate and create/data/ folderwith absolute pathsdetermine package versionsGBMXGBoostrow_id excludedtypeParquet Reader XGBoost TreeEnsemble Learner Model Writer Gradient BoostedTrees to PMML Gradient BoostedTrees Learner Parquet Reader Binary ClassificationInspector Parquet Reader XGBoost Predictor Model Reader PMML Writer PMML Reader PMML Gradient BoostedTrees Predictor Binary ClassificationInspector Binary ClassificationInspector Python Script Java EditVariable (simple) ConstantValue Column Concatenate Sorter RowID conda_environment_kaggle_knime4 Merge Variables Python Script Binary ClassificationInspector Python Script Column Filter Variable toTable Row Collect LocalMetadata Python Script Transpose ConstantValue Column ConstantValue Column Column Filter Rule Engine

Nodes

Extensions

Links