AutoML (Regression)

This Component automatically trains supervised machine learning models for regression. The component is able to automate the whole ML cycle by performing some data preparation, parameter optimization with cross validation, scoring, evaluation and selection. The component also captures the entire end-to-end process and outputs the deployment workflow using the KNIME Integrated Deployment Extension.

For solving an ML classification task, check instead the “AutoML” component (kni.me/c/33fQGaQzuZByy6hE).

STEP-BY-STEP GUIDE:
- Drag&drop the Component from KNIME Hub to KNIME Analytics Platform.
- Connect with your data table of features and target column. Consider using a subsample first.
- IMPORTANT! Execute all up-stream nodes.
- Double click Component to open its dialogue.
- Save your settings with “OK” and execute the Component.
- Wait for models to train, tune, validate, etc. and the best one to be selected and exported.
- Connect the Workflow Executor/Writer node to the Component output to reuse the model.
- (OPTIONAL) Right click Component : “Component” > “Open” to inspect our implementation for you to customize.
- (IF PREVIOUSLY ENABLED) Right click Component : “Open Interactive View: AutoML” to inspect all trained models. Selecting one manually (with “Apply&Close” in local View bottom right corner controls) unfortunately requires training all models again.

DATA PREPARATION:
Before training the models the data is cleaned by replacing the missing values with the categorical column most frequent value or the mean for the numerical columns. Optionally the categorical data can be one-hot encoded and columns with too many unique values are removed based on a user-defined parameter. Numerical features and the target are all converted to double, normalized using Z-score normalization. The data is automatically split into the two train and test partitions using stratified sampling technique on the target class and 80% split. The data preparation models are stored for deployment both for pre-processing and post-processing the data around the model predictor.

MODEL TRAINING:
Each model has a number of parameters to be tuned using cross validation and the user-defined evaluation metric on train data. The extent of the parameter optimization, the optimization strategy as well as other settings of the model can be changed directly in the Component.
- Regression Tree: trained with optimized parameter “Min number records per node”
- Linear Regression: trained with default parameters
- Polynomial Regression: trained with optimized parameter “Polynomial degree”
- H2O Generalized Linear Model: trained with the KNIME H2O Machine Learning Integration trained with optimized parameters “alpha” and “lambda”
- XGBoost Linear Ensemble: trained with optimized parameters “alpha” and “lambda”
- XGBoost Tree Ensemble: trained with optimized parameters “eta” and “max depth”
- Gradient Boosted Trees: trained with optimized parameter “Number of trees”
- Random Forest: trained with optimized parameters “Tree Depth”, “Number of models” and “Minimum child node size”
- Deep Learning (Keras): trained with KNIME Deep Learning - Keras Integration with no parameter optimization and a simple architecture for regression determined with a few simple heuristics.
- H2O AutoML: trained with the KNIME H2O Machine Learning Integration and uses the H2O AutoML to train a group of models and select the best one

MODEL SCORING AND SELECTION:
After the training of the specified models is completed and all models are stored in a single table, the system applies the model to the test set. The predictions of all models are scored against the ground truth and several performance metrics are computed. The best model is selected using the performance metric specified by the user.

DEPLOYMENT WORKFLOW:
The data pre-processing, the best model and the data post-processing are captured via the KNIME Integrated Deployment Extension. The end-to-end encapsulated workflow is provided at the output of the Component and it can be used to score raw new data in deployment. Connect to the Workflow Writer node or the Workflow Executor node to reuse the trained model wherever needed.

AUTOML OUTPUT METADATA:
The Component additionally outputs flow variables for advanced users.
- "metric_auto" (String) : the name of the user-defined performance metric.
- "target_column" (String) : the name of the user-defined target column.
- "exported_model" (String) : the best model that was selected.
- "exported_model_params” (String Array) : list of the optimized parameters names and values for the exported model.
- "trained_models" (String Array) : list of all the selected models that were successfully trained and ranked by "metric_auto" metric.
- "trained_metrics" (Double Array) : list of the "metric_auto" metrics for all “trained_models”.
- "failed_models" (String Array) : list of all selected models failed during training or testing.
- "extreme_preds_models" (String Array) : models that had at least one prediction out of range are additionally listed here when the “Remove Extreme Predictions” setting is on.

Options

Enable One Hot Encoding of String Columns:
By checking this box all columns of Domain "String", that is categorical features, are one hot encoded. The resulting Double columns are going to replace all String columns during training. DISCLAIMER: For Deep Learning (Keras) and Polynomial Regression models this setting is necessary if you are providing only String columns.
Activate Interactive View:
If selected the Component creates an interactive view to browse the models ranked by the selected metric.
Remove Extreme Predictions:
Models performing a regression task can output values that are simply unrealistic given the domain of the target value: either too great or too small. Keeping even a handful of 'extreme predictions' is going to impact the measured performance on the model. We enabled a system to produce a model which automatically removes the extreme predictions based on the input target distribution. To evaluate all models on the same test set, predictions that are extreme for at least one model are removed for all models before computing performance. Please notice that the component does not automatically remove outliers in the input data. When the output workflow object is adopted on new data, extreme predictions are going to be replaced with missing values. See “Extreme Predictions Range” to understand how extreme predictions are detected.
Feature Column Selection:
Select the columns which the model should use as input features during training. Excluded columns are discarded and won't be used at all in the workflow. Domain accepted: Number (Integer), Number (double), Number (long) and String.
Target Column:
Select which column of Numeric type you want to predict.
Extreme Predictions Range:
The non-negative constant k that will be used as a parameter for detection and removal of extreme values among the predictions. Extreme predictions will be detected using the target column distribution of the train partition and will be replaced by missing values. The range of normal predictions is the following: mean +- k * sd. Setting k to 0 will replace all the predictions. Setting k to 1.5 will remove most predictions and not only the extreme ones. Setting k >= 3 should remove the most extreme cases. Deactivate the removal of extreme predictions by using the “Remove Extreme Predictions” setting.
Number of Folds in Cross Validation:
A k-fold cross validation takes place in the various parameter optimization phases. Insert the number of folds here.
Size of Training Set Partition (%):
Enter the size of the train set in percentage (%) to define the number of rows that will be used to train the models. The Test set partition is defined by the remaining rows (100% - defined value). Random sampling is performed.
Maximum Amount of Unique Values in a Categorical Column:
Categorical columns with more than this amount of unique values will be removed. This setting ensures you are not starting an endless training process because you forgot to remove columns such RowIDs.
Models to Train:
Select which machine learning algorithms should be used in the AutoML process. The H2O AutoML is going to train even more models types and ensembles: if selected your machine might become slow for a maximum of 2 minutes.
Metric for Auto Selection:
Select performance metric that should be used to automatically select the best model and tune the hyperparameters.
Output Settings:
Select the output format of the captured workflow created by the Component. By "features" we mean the columns selected by the user in the component configuration under "Feature Column Selection". By “prepared” we mean features processed from raw format to the format required by the model or the user. Any extra and unexpected column not recognized as a feature, such as an additional label or identifier, can still be provided to the captured workflow and it will be kept at its output no matter what you select here.

Input Ports

Icon
A KNIME Table with data rows with input features and ground truth.

Output Ports

Icon
The best trained model stored in a Workflow Object port of KNIME Integrated Deployment Extension. Connect this output port to either the Workflow Writer or Workflow Executor node.

Nodes

Extensions

Links