K-means LIME

This is an implementation of the model explanation technique developed by H2O.ai called K-LIME using the KNIME H2O Machine Learning Integration. To find more informations about the K-LIME machine learning interpretability technique please refer to the H2O.ai documentation:

h2o.ai/wp-content/uploads/2017/09/driverlessai/interpreting.html#k-lime

The component allows to cluster input data and build local linear models to explain predictions of a complex black-box-like model. The optimal number of clusters is defined by the optimal value of R^2 value summed over all models in clusters. The number of linear models built is much lower than the corresponding number in the LIME algorithm, where a model is build in the neighbourhoud of each explanation instance. Thus, the algorithm is expected to be faster for large data samples.

Note, that this implementation of K-LIME can not handle missing values and all rows with missing values will be dropped. Thus, no explanations for those will be available. If you want to get explanations for all input rows, please, fill missing values before fedding the data into the app.
Also note, that all but the selected column with predictions will be used to interpret prediction. Therefore, you have to filter out all irrelevant columns as well as the original target column prioir to feeding a table to the input.

Categorical features will be converted into numerical representation using One-Hot Encoding (OHE) (also called "dummy" or "binary"). Numerical features will be scaled as required by linear models. The only pre-processing steps that might be required from the user is to remove outliers and fill missing values, as those might bias surrogate models that are used by the algorithm.

Options

Do lambda search in GLM fit?
Choose if you want to use lambda search in the GLM fit. Ridge regression is used. If this option is disabled, then regularisation strength is fixed at zero. Otherwise a search for the optimal lambda is performed. Past experiments show that often disabling this option yields a surrogate model that better describes the black-box model.
Select the column containing predictions
Predictions of the model.%%00010Predictions can be numeric values only- either the predicted target in the case of regression or the probabilities in the case of classification. Multi-class classification is not supported at this stage.
Maximum number of clusters
A search of the optimal number of clusters in k-means will be performed in the range from 1 up to this value.%%00010This is very important parameter. Consider to try different value, if the output surrogate model does not do a good job describing the input predictions.
Minimum cluster size
If a cluster is smaller than this, the global prediction will be used instead of the cluster model
Max n features in k-means
The maximum number of best features (as defined by a surrogate Random Forest model) to be used in the clustering

Input Ports

Icon
Table with features and predictions.

Output Ports

Icon
Original and surrogate predictions as well as individual feature contributions.
Icon
Cluster and global prediction per input row.

Nodes

Extensions

Links