AutoML

This Component automatically trains supervised machine learning models for both binary and multiclass classification. The component is able to automate the whole ML cycle by performing some data preparation, parameter optimization with cross validation, scoring, evaluation and selection. The component also captures the entire end to end process and outputs the deployment workflow using the KNIME Integrated Deployment Extension.

STEP-BY-STEP GUIDE:
- Drag&drop the Component from KNIME Hub to KNIME Analytics Platform.
- Connect with your data table of features and target column. Consider using a subsample first.
- IMPORTANT! Execute all up-stream nodes.
- Double click Component to open its dialogue.
- Save your settings with “OK” and execute the Component.
- Wait for models to train, tune, validate, etc. and the best one to be selected and exported.
- Connect Workflow Executor/Writer node to the Component output to reuse the model.
- (OPTIONAL) Right click Component : “Component” > “Open” to inspect our implementation for you to customize.
- (IF PREVIOUSLY ENABLED) Right click Component : “Open Interactive View: AutoML” to inspect all trained models. Selecting one manually (with “Apply&Close” in local View bottom right corner controls) unfortunately requires training all models again.

DATA PREPARATION:
Before training the models the data is cleaned by replacing the missing values with the categorical column most frequent value or the mean for the numerical columns. Optionally the categorical data can be one-hot encoded and columns with too many unique values are removed based on a user-defined parameter. Numerical features are all converted to double, normalized using min-max normalization. The data is automatically split into the two train and test partitions using stratified sampling technique on the target class and 80% split. The data preparation models are stored for deployment both for pre-processing and post-processing the data around the model predictor.

MODEL TRAINING:
Each model has a number of parameters to be tuned using cross validation and the user-defined evaluation metric on train data. The extent of the parameter optimization, the optimization strategy as well as other settings of the model can be changed directly in the Component.

- Naive Bayes: trained with optimized parameter “Default probability”.
- Logistic Regression: trained with optimized parameter “Step size”.
- Neural Network: an Rprop Multi-layer Perceptron (MLP) trained with optimized parameters “Number of hidden layers” and “Number of hidden neurons per layer”.
- Gradient Boosted Trees: trained with optimized parameter “Number of trees”.
- Decision Tree: trained with optimized parameter “Min number records per node”.
- Random Forest: trained with optimized parameters “Tree Depth”, “Number of models” and “Minimum child node size”.
- XGBoost Trees: trained with optimized parameters “eta” and “max depth”.
- Generalized Linear Model (H2O): trained with the KNIME H2O Machine Learning Integration with optimized parameters “lambda” and “alpha”.
- Deep Learning (Keras): trained with KNIME Deep Learning - Keras Integration with no parameter optimization and two simple architectures for binary and multiclass classification.
A few simple heuristics are in place to shape the network architecture and Keras training process based on the size of the input data.

MODEL SCORING AND SELECTION:
After the training of the specified models is completed and all models are stored in a single table, the system applies the model to the test set. The predictions of all models are scored against the ground truth and several performance metrics are computed. The best model is selected using the performance metric specified by the user.

DEPLOYMENT WORKFLOW:
The data pre-processing, the best model and the data post-processing are captured via the KNIME Integrated Deployment Extension. The end-to-end encapsulated workflow is provided at the output of the Component and it can be used to score raw new data in deployment. Connect to Workflow Writer node or Workflow Executor node to reuse the trained model where ever needed.

AUTOML OUTPUT METADATA:
The Component additionally outputs flow variables for advanced users.
- "metric_auto" (String) : the name of the user-defined performance metric.
- "target_column" (String) : the name of the user-defined target column.
- "positive" (String) : the positive class used in binary classification.
- "exported_model" (String) : the best model that was selected.
- "exported_model_params” (String Array) : list of the optimized parameters names and values for the exported model.
- "trained_models" (String Array) : list of all the selected models that were successfully trained and ranked by "metric_auto" metric.
- "trained_metrics" (Double Array) : list of the "metric_auto" metrics for all “trained_models”.
- "failed_models" (String Array) : list of all selected models failed during training or testing.
- "static_prediction_models" (String Array) : models always predicting the majority class are discarded and listed here.

Options

Enable One Hot Encoding of String Columns:
By checking this box all columns of Domain "String", that is categorical features, are one hot encoded. The resulting Double columns are going to replace all String columns during training. DISCLAIMER: For Neural Network and Deep Learning (Keras) models this setting is necessary if you are providing only String columns.
Activate Interactive View
If selected the Component creates an interactive view to browse the models ranked by the selected metric.
Feauture Column Selection:
Select the columns which the model should use as input features during training. Excluded columns are discarded and won't be used at all in the workflow. Domain accepted: Number (Integer), Number (double), Number (long) and String.
Target Column:
Select which String column you want to predict.%%00010
Number of Folds in Cross Validation:
A k-fold cross validation takes place in the various parameter optimization phases. Insert the number of folds here.
Size of Training Set Partition (%):
Enter the size of the train set in percentage (%) to define the number of rows that will be used to train the models. The Test set partition is defined by the remaining rows (100% - defined value). Stratified sampling on the target class is performed.
Maximum Amount of Unique Values in a Categorical Column:
Categorical columns with more than this amount of unique values will be removed. This setting ensures you are not starting an endless training process because you forgot to remove columns such RowIDs.
Models to Train:
Select which machine learning algorithms should be used in the AutoML process.
Metric for Auto Selection:
Select performance metric that should be used to automatically select the best model and tune the hyperparameters.

Input Ports

Icon
A KNIME Table with data rows with input features and ground truth.

Output Ports

Icon
The best trained model stored in a model port with KNIME Integrated Deployment. Connect this output port to either the Workflow Writer or Workflow Executor node.

Nodes

Extensions

Links