SHAP Loop Start

SHAP is an acronym for SHapley Additive exPlanations and represents a unified approach to explain the predictions of any machine learning model. For a single output (e.g. probability of the positive class in a binary classification) it assigns to each feature a so-called Shapley Value that quantifies how this particular feature changed the output. If you have multiple outputs, multiple such Shapley Value sets are calculated. The sum of all Shapley Values for a single output adds up to the deviation from the mean prediction (aka null prediction), which is the prediction the model would have made if no feature had been available. KNIME Analytics Platform also offers a second means to calculate Shapley Values via the Shapley Values loop nodes. In contrast to these, SHAP allows you to also find sparse explanations via regularization with the LASSO. The advantage of this is that you can pick the maximal number of features you want to have in your explanation, which makes the explanations far more understandable in cases with hundreds or thousands of features. If a maximal number of features is specified, SHAP will find for each explainable row those features that have the most impact on its prediction and then only consider those when calculating the Shapley Values.

Usage

The first input table of this node contains the rows of interest (ROI), for which an explanation is required. The SHAP algorithm replaces certain subsets of features of a ROI and observes how the model output changes. These replacement features are taken from the second input table. Note that in contrast to the Shapley Values and LIME loops, this sampling table should not be much larger than 100 rows so as to keep the runtime reasonable (don't worry, SHAP is usually still on par with the other methods). The output of the SHAP Loop Start node contains only those columns specified as feature columns in the dialog. This table has to be predicted by the model, whose predictions you want to better understand, and then fed into the SHAP Loop End node to calculate the explanations. Note that the SHAP loop has n + 1 iterations where n is the number of ROIs (rows of the first input table). The first iteration is special as it doesn't explain a ROI like the other iterations but is used to estimate the mean prediction by letting the model predict the sampling table (second input of the SHAP Loop Start). In the loop body you should use your model to predict the data produced by the SHAP Loop Start node and feed the table containing the appended predictions into the SHAP Loop End node. Note that SHAP can only explain numerical predictions, so you have to configure your predictor to output probabilities in case of a classification task.

Options

Feature columns
The columns you want to evaluate as features. Typically, these are those columns that your model needs to perform a prediction. If you only want to evaluate a subset of the columns your model needs, check the "Retain non-feature columns" option. Note that if the retain option is not set, only the feature columns will be contained in the ouput of this node. IMPORTANT: The number of features has a strong impact on the number of rows produced per row in the first input table and will hence increase the runtime. Make sure that only columns used by your model are selected as feature columns.
Every collection column represents a single feature
Typically, collection columns and vectors hold a large number of features (e.g. a bit vector where each position indicates the presence/absence of a word in a document) but it's also possible that a collection/vector only represents a single feature. By checking this box, SHAP will consider the latter case.
Retain non-feature columns
If this option is set, all non-feature columns of the current ROI are appended to the rows in the output table. This is useful if you want to evaluate only a subset of the actual features your model uses.
Explanation set size
The maximum number of samples SHAP is allowed to use for its estimations. Ideally, SHAP can evaluate all possible feature subsets (excluding the empty set and the full set) but there are 2^f - 2 many of these subsets where f is the number of features your model uses. Hence, if the explanation set size m is smaller than 2^f - 2, SHAP will try to enumerate as many subsets as possible and sample from the remaining subsets until the maximal explanation set size is reached. Note that the size of this node's output table is m * n where n the size of the sampling table.
Sampling weight
Since the size of the output table directly depends on the number of rows in the sampling table, it is recommended to only use up to 100 rows to keep the runtime and table size reasonable. However, 100 rows might not be enough to capture a complex dataset. SHAP proposes to overcome this issue by calculating a k-Means clustering and then using the cluster centroids as representatives. Because different clusters might have different sizes, each centroid receives a weight proportional to the number of rows it represents. This weight is then used by SHAP to calculate its estimations of the SHAP values. The weight column may only contain values larger than zero (e.g. number of rows in a cluster). They do not need to sum up to 1 as SHAP will do this normalization internally. If no weight column is specified, SHAP will assign each row in the sampling table with the same weight.
Use seed
Using a seed allows you to reproduce the results of the loop. If this box is checked the seed displayed in the text box is used, otherwise a new seed is generated for each execution.

Input Ports

Icon
Table containing the rows to be explained.
Icon
Table containing rows for sampling.

Output Ports

Icon
This table contains rows that have to be predicted by the predictor node corresponding to the model whose predictions you want to explain.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.