Global Feature Importance

This component is able to compute Global Feature Importance for classification models with up to 4 different techniques.

The component additionally offers an optional interactive view to explore the results (Right Click > Open Interactive View).

The model to be explained needs to be captured within a Workflow Object via Integrated Deployment.

The data provided should contain instances the model is able to process to compute predictions. It would be best to provide a sample similar to a test or validation set: representative of the entire distribution and never used during training.

Please notice that it is not recommended to use a surrogate model to explain either a GLM or Logistic Regression, a Decision Tree or a Random Forest, but it is still possible.

Available Global Feature Importance methods/techniques:

A) GLOBAL SURROGATE MODELS:

Surrogate models are simply interpretable models that are trained to mimic the behaviour of the original model by overfitting its predictions. The intuition is that if the surrogate and interpretable model is able to make the exact same predictions of the original model, then it can be used to understand how the input features are connected to those predictions. The quality of the surrogate models is estimated with the user-defined performance metric.

Before training the surrogate models:
- the data rows are cleaned by replacing the missing values with the categorical column most frequent value or the mean for the numerical columns;
- optionally the categorical columns with too many unique values can be removed based on a user-defined parameter;
- numerical features are converted to double and normalized using min-max normalization.

Three interpretable models are available:

A1) Surrogate Generalized Linear Model (GLM):
GLM is trained with the KNIME H2O Machine Learning Integration with optimized parameters “lambda” and “alpha”. The family (model type) is either binomial or multinomial for binary or multinomial classification, respectively. GLM coefficient measures feature importance. If there are categorical features, surrogate GLM is not trained due to decreasing interpretability.

A2) Surrogate Decision Tree Model:
Decision Tree is trained with optimized parameter “Min number records per node”. The Decision Tree structure indicates the importance of the top-level level features since they separate the data into classes in the best way.

A3) Surrogate Random Forest Model:
Random Forest is trained with optimized parameters “Tree Depth”, “Number of models” and “Minimum child node size”. Feature importance is calculated by counting how many times it has been selected for a split and at which rank (level) among all available features (candidates) in the trees of the random forest.

B) PERMUTATION FEATURE IMPORTANCE:

Permutation feature importance measures the difference between the model performance score estimated on predictions using all the original features and the model performance score estimated on predictions using all the original features except one which was randomly permuted. If a feature was permuted several times, the average difference is calculated. The process is repeated for each feature. The score difference standard deviation from permutations is provided as an additional output.

More information at:
Molnar, Christoph. "Interpretable machine learning", 2019.
christophm.github.io/interpretable-ml-book

Options

Activate interactive view:
If selected the Component creates an interactive view.
Feature columns selection:
Select the columns which the original model used as input features during training. Domain accepted: Number (Integer), Number (double), Number (long) and String.
Surrogate models data pre-processing: maximum percentage of unique values in a categorical column:
Categorical columns with the percentage of unique values higher than defined will be removed. By default, no columns are filtered.
For the method "Permutation Feature Importance": the number of permutations:
Select how many times each feature should be permuted. Permuting a feature several times and taking the average score difference can provide more stable results but will increase the execution time.
Show top n features:
Select the number of the most important features that will be visualized. For the rest of the features their average importance will be shown. This setting is only important for the visualization: importance of all the features will be returned in the data outport.
Importance methods:
Select the available method(s) to compute feature importance.
Performance metric:
Select the performance metric that should be used to:%%00010- optimize surrogate models,%%00010- evaluate surrogate models performance,%%00010- evaluate a score difference after a feature permutation.
Target column and focus class:
Select which String column your original model predicts. Make sure that the column containing predictions in the input Workflow Object is called "Prediction ()"%%00010- For binary classification, select a positive class.%%00010- For multinomial classification, select one class of interest.

Input Ports

Icon
Production Workflow containing input model, stored as a Workflow Object via Integrated Deployment nodes
Icon
Data from Test Set Partition with available Target (Ground Truth) column

Output Ports

Icon
Table with Global Feature Importance measured by different techniques.

Nodes

Extensions

Links