LIME Loop Start

LIME stands for Local Interpretable Model-agnostic Explanations. It tries to explain individual predictions of a black box model by training a local surrogate model that is easier to understand (e.g. a linear model). The intuition behind this approach is that a globally nonlinear model might actually be linear within a small local region of the feature space. In order to learn this kind of local surrogate model, LIME creates a dataset of perturbed rows for a single row of interest, predicts it with the black box model and then learns a local surrogate, which approximates the predictions of the black box model. For more details on the algorithm please see the paper "Why Should I Trust You?" Explaining the Predictions of Any Classifier by Ribeiro et al.

Usage

The top input of this node contains the rows of interest for which the predictions of your model should be explained. Each row in the top table corresponds to one loop iteration, so its size will directly affect the runtime of the loop. The bottom input table is used for sampling, which, in this case, means that column statistics are calculated for all of the feature columns. These statistics are later used to sample new values for the feature columns.

In each iteration of the loop one row of interest is explained. This node produces two tables used for these explanations. The top table contains rows, which are created by sampling according to the statistics of the feature columns in the sampling table. Note that numeric columns (including bit and byte vectors) are assumed to be distributed normally. This table has to be predicted with the Predictor node appropriate to your model at hand. The bottom table is intended for training a local surrogate model (e.g. a linear model). It differs from the top table as follows:

  1. Nominal feature columns are replaced by Double columns where a 1.0 indicates that the sampled value matches that of the row of interest.
  2. Bit and byte vector columns are split up into multiple columns, one for each element.
  3. A weight column is appended, which indicates how similar the sampled row is to the row of interest. A higher value indicates greater similarity.
The loop body should do the following:
  1. Predict the top table with the black box model (predictions must be numerical i.e. in case of a classification model the class probabilities).
  2. Append the prediction column(s) to the bottom table.
  3. Train a local surrogate model that uses the features from the bottom table, weights each row according to the weight column, and approximates the predictions of the black box model. The currently recommended Learner for this task is the H2O Generalized Linear Model Learner (Regression).
  4. Extract and collect the local explanations from the local surrogate model (e.g. the linear coefficients) in one of our Loop End nodes.

Note on vector columns

Since the number of elements in a vector column is not known during configuration, the spec for the second table can't be generated if vectors are among the feature columns. In this case downstream nodes can only be configured once this node has been executed.

Options

Feature columns
The feature columns, which are used by your model. These columns will be contained in the top table that has to be predicted by your model. For nonvector columns, the bottom table will also contain one column per feature where nominal columns are replaced by numeric columns.
Retain non-feature columns
If this option is set, all non-feature columns of the current ROI are appended to the rows in the first output table of this node. This is useful if you want to evaluate only a subset of the actual features your model uses. Note that the second output table is not affected by this option.
Explanation set size
The number of rows to use for learning the local surrogate model for a single incoming row of interest.
Sample around instances
If checked, samples for numerical columns are drawn around the value of the current row of interest. Otherwise samples are drawn around the mean of the feature (which is calculated from the sampling table).
Use seed
Using a seed allows you to reproduce the results of the loop. If this box is checked the seed displayed in the text box is used, otherwise a new seed is generated for each execution.
Use element names for vector features
Vector columns like Bit and Byte vectors can contain names for their individual elements. If this option is set, these names are used if possible i.e. if the number of element names matches the element count. If this option is not set or the number of names doesn't match the number of elements, new names based on the vector name are created.
Manual kernel width
LIME uses an exponential kernel to calculate the similarity of a sampled row to the row explained. The exponential kernel is defined as sqrt(exp(-(d^2) / w^2)) where d is the Euclidean distance of two datapoints and w is the kernel width. Intuitively, the kernel width controls how local the surrogate model is. A larger kernel width means a larger region around the row that needs to be explained is considered. By default the kernel width LIME uses for its exponential kernel is sqrt(number of features) * 0.75 but by checking this box it is also possible to provide a custom kernel width.

Input Ports

Icon
Table containing the rows to be explained.
Icon
Table containing rows used to perturb rows in the first table.

Output Ports

Icon
This table contains samples that have to be predicted by the Predictor node corresponding to your particular model.
Icon
This table contains the data used to learn a local surrogate model including a weight column. (The name of the column holding the weights is output as a flow variable with the name weightColumnName ).

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.