This Component automatically trains supervised machine learning models for both binary and multiclass classification. The component is able to automate the whole ML cycle by performing some data preparation, parameter optimization with cross validation, scoring, evaluation and selection. The component also captures the entire end to end process and outputs the deployment workflow using the KNIME Integrated Deployment Extension.
STEP-BY-STEP GUIDE:
- Drag&drop the Component from KNIME Hub to KNIME Analytics Platform.
- Connect with your data table of features and target column. Consider using a subsample first.
- IMPORTANT! Execute all up-stream nodes.
- Double click Component to open its dialogue.
- Save your settings with “OK” and execute the Component.
- Wait for models to train, tune, validate, etc. and the best one to be selected and exported.
- Connect Workflow Executor/Writer node to the Component output to reuse the model.
- (OPTIONAL) Right click Component : “Component” > “Open” to inspect our implementation for you to customize.
- (IF PREVIOUSLY ENABLED) Right click Component : “Open Interactive View: AutoML” to inspect all trained models. Selecting one manually (with “Apply&Close” in local View bottom right corner controls) unfortunately requires training all models again.
DATA PREPARATION:
Before training the models the data is cleaned by replacing the missing values with the categorical column most frequent value or the mean for the numerical columns. Optionally the categorical data can be one-hot encoded and columns with too many unique values are removed based on a user-defined parameter. Numerical features are all converted to double, normalized using min-max normalization. The data is automatically split into the two train and test partitions using stratified sampling technique on the target class and 80% split. The data preparation models are stored for deployment both for pre-processing and post-processing the data around the model predictor.
MODEL TRAINING:
Each model has a number of parameters to be tuned using cross validation and the user-defined evaluation metric on train data. The extent of the parameter optimization, the optimization strategy as well as other settings of the model can be changed directly in the Component.
- Naive Bayes: trained with optimized parameter “Default probability”.
- Logistic Regression: trained with optimized parameter “Step size”.
- Neural Network: an Rprop Multi-layer Perceptron (MLP) trained with optimized parameters “Number of hidden layers” and “Number of hidden neurons per layer”.
- Gradient Boosted Trees: trained with optimized parameter “Number of trees”.
- Decision Tree: trained with optimized parameter “Min number records per node”.
- Random Forest: trained with optimized parameters “Tree Depth”, “Number of models” and “Minimum child node size”.
- XGBoost Trees: trained with optimized parameters “eta” and “max depth”.
- Generalized Linear Model (H2O): trained with the KNIME H2O Machine Learning Integration with optimized parameters “lambda” and “alpha”.
- Deep Learning (Keras): trained with KNIME Deep Learning - Keras Integration with no parameter optimization and two simple architectures for binary and multiclass classification.
A few simple heuristics are in place to shape the network architecture and Keras training process based on the size of the input data.
MODEL SCORING AND SELECTION:
After the training of the specified models is completed and all models are stored in a single table, the system applies the model to the test set. The predictions of all models are scored against the ground truth and several performance metrics are computed. The best model is selected using the performance metric specified by the user.
DEPLOYMENT WORKFLOW:
The data pre-processing, the best model and the data post-processing are captured via the KNIME Integrated Deployment Extension. The end-to-end encapsulated workflow is provided at the output of the Component and it can be used to score raw new data in deployment. Connect to Workflow Writer node or Workflow Executor node to reuse the trained model where ever needed.
AUTOML OUTPUT METADATA:
The Component additionally outputs flow variables for advanced users.
- "metric_auto" (String) : the name of the user-defined performance metric.
- "target_column" (String) : the name of the user-defined target column.
- "positive" (String) : the positive class used in binary classification.
- "exported_model" (String) : the best model that was selected.
- "exported_model_params” (String Array) : list of the optimized parameters names and values for the exported model.
- "trained_models" (String Array) : list of all the selected models that were successfully trained and ranked by "metric_auto" metric.
- "trained_metrics" (Double Array) : list of the "metric_auto" metrics for all “trained_models”.
- "failed_models" (String Array) : list of all selected models failed during training or testing.
- "static_prediction_models" (String Array) : models always predicting the majority class are discarded and listed here.
To use this component in KNIME, download it from the below URL and open it in KNIME:
Download ComponentDeploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.
Try NodePit Runner!Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well? Do you think, the search results could be improved or something is missing? Then please get in touch! Alternatively, you can send us an email to mail@nodepit.com.
Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.