Shapley Values Loop Start

Shapley Values originated in game theory and in the context of machine learning they have recently became a popular tool for the explanation of model predictions. The Shapley Value of a feature for a certain row and prediction indicates how much the feature has contributed to the deviation of the prediction from the base prediction (i.e. the mean prediction over the full sampling data). In theory the Shapley Values of all features add up to the difference between the mean prediction and the actual prediction but this loop only produces approximations because it is typically infeasible to calculate the exact Shapley Values.

Usage

A typical Shapley Values loop will consist of only three nodes: The Shapley Values Loop Start node, the predictor node for the model you want to explain (e.g. a Random Forest Predictor node) and the Shapley Values Loop End node.

For each row in the ROI (Row of Interest) table, the Shapley Values Loop Start node creates a number of perturbed rows i.e. rows where some of the features are randomly exchanged with the features from rows in the sampling table (for the exact details of the algorithm we refer to algorithm one in the paper Explaining prediction models and individual predictions with feature contributions by Strumbelj and Kononenko). Your task is to obtain predictions for these permuted rows (usually via the Predictor node corresponding to your model). The Shapley Values Loop End node collects these predictions and calculates an approximation of the Shapley Values for each feature target combination.

A note on collections and vectors

These nodes support collection and vector columns such as List columns, Bit Vector and Byte Vector columns, in case of which each element of the position/vector can be treated as an individual feature. Note that this requires all collections/vectors in a single column to be of the same length i.e. contain the same number of elements. It is also possible to treat collections and vectors as single features, in which case the respective option has to be set in the dialog.

Options

Feature columns
The columns you want to evaluate as features. Typically, these are those columns that your model needs to perform a prediction. If you only want to evaluate a subset of the columns your model needs, set the "Retain non-feature columns" option. Note that if the retain option is not set, only the feature columns will be contained in the ouput of this node. IMPORTANT: The number of features has a strong impact on the number of rows produced per row in the first input table and will hence improve the runtime. Make sure that only columns used by your model are selected as feature columns.
Every collection column represents a single feature
If checked, collection and vector columns are treated as a whole, i.e. either the whole collection is kept or changed as opposed to changing individual positions within the collections.
Retain non-feature columns
If this option is set, all non-feature columns of the current ROI are appended to the rows in the output table. This is useful if you want to evaluate only a subset of the actual features your model uses.
Iterations per feature
How often should the Shapley Value be sampled for each single feature. This directly affects the runtime of the loop since for each row in the first input table 1 + 2 * number of features * number of iterations per feature rows are created and have to be predicted. The higher this number, the better the approximation but the longer the runtime.
Use seed
Using a seed allows you to reproduce the results of the loop. If this box is checked, the seed displayed in the text box is used, otherwise a new seed is generated for each execution. The new button allows you to generate a new random seed. Note that the row sampling process (i.e. how rows are selected from the sampling table) is based on the quasi random Sobol sequence and therefore actually deterministic (will be the same even if no seed is used). The random seed only applies to the permutation of features in algorithm 1 from Explaining prediction models and individual predictions with feature contributions by Strumbelj and Kononenko.

Advanced Options

Chunk size
Since every row in the input may result in a very large number of rows in the output of the loop start node, this option allows you to specify how many input rows should be handled per loop iteration. The output of the loop start will have chunk size * (1 + 2 * number of features * number of iterations per feature) rows (unless the number of rows left to process is smaller than chunk size).

Input Ports

Icon
Table containing the rows to be explained.
Icon
Table containing rows used to perturb rows in the first table.

Output Ports

Icon
Perturbed rows.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.