Binary Classification Inspector

This node produces a complex view made of four different charts in order to compare, optimize and select predictions of different binary classifiers:

  • Compare a number of binary classifier machine learning models predicting the same target on the same test data using performance metrics and ROC curves
  • Optimize a model by finding the best threshold given a performance metric of your choice
  • Interactively select a given type of predictions (e.g. true positives) of one of the models and export them at the output of the node

The user journey when using this view follows these steps:

  1. Compare model AUC to select the best model via the Model's Statistics bar chart and the Model's ROC Curves chart in the top panel;
  2. Change the threshold from it's initial value either manually, via the Threshold Slider , or automatically maximizing one of the available function (e.g. F-Measure) via the Model Tab Dropdown.
  3. Look back at the top panel to see how the new threshold impacts the model when compared with the other models.
  4. Inspect the Confusion Matrix in the bottom panel to assess the gravity of the misclassification, give the associated probability confidence of the model on the Classification Distribution chart.
  5. Combine this view with other KNIME views in a Component to interactively visualize different types of visualizations (e.g. false positives) with interactive selection events.
  6. After interaction use the "Apply" features to export the new threshold and selected model from the node as flow variables. This will also export: A) selected model predictions B) selection of the confusion matrix cell C) selected model performance statistics.

The node supports custom CSS styling. You can simply put CSS rules into a single string and set it as a flow variable 'customCSS' in the node configuration dialog. You will find the list of available classes and their description on our documentation page.

Options

Main

Execution critical options affecting the success and output of the node.

Maximum number of rows
This option allows only a subset of the input data to be processed for the statistics and the view. This limit will be the maximum number of rows used to calculate the statistics. However, if using the thresholds to apply new labels to columns, ALL entries will be labeled using this threshold, not just the maximum number chosen with this option.
Target column
Choose the target column containing the ground truth or actual values for each row. This column should be a compatible Nominal value column (containing Strings or Boolean) and it must have only two possible values in order to properly configure and execute this node.
Positive class value
Choose the class value of the positive or target class. This value will be used for during relabeling and the alternate class (the non-selected class) will be used as the negative class value.
Model prediction columns
Choose the model prediction columns columns containing the positive class probabilities output by the models. These columns should contain values from 0 to 1 representing the models confidence that the row is a member of the positive class.
Retain non-prediction columns in the output table
Check this option if you want to keep all input columns in the output table along with any new columns. Un-check this option if you want to keep ONLY the selected model prediction columns, the ground truth column and any new columns appended during execution.
Append new predictions for all models
Select this option if you want to have a column appended for each of the selected model prediction columns. Each column will be named in the format "new_classification_" + the name of the model prediction column and will contain a new prediction based on the chosen threshold for that model. If the threshold for that model is modified in the view and then these settings are applied, the labels will be applied based on this new threshold. If the node is not modified in the view, then the threshold calculated using the chosen "Threshold Method" in the dialog will be used for the node.
The thresholds used for the labeling of each appended columns can be found in the "Thresholds" column for the corresponding model in the 2nd output table.
Append and label prediction column for the selected model
Selecting this option will append one column to the output table containing the newly labeled data based on the selected model and threshold in the view. The column will be named "Selected Model Predictions" by default but can be changed with the "Label" option below .
If no model is selected in this view, then the column will contain only missing values.
The chosen model column name will also be output in a flow variable named "chosen_model". This variable will be equal to "null" if no model was selected.
The threshold for the chosen model will also be output as a flow variable named "chosen_threshold". This variable will be equal to "NaN" if no model was selected.
This column is appended independent of the prediction columns appended if the "Append new predictions for all models" option was enabled.
Label
This option is used to define the prefix for the column appended if the "Append and label prediction column for the selected model" option is enabled. The default value is "Selected Model Predictions" which is combined with "_" and the name of the selected column (if one is chosen in the view). If one is not chosen in the view, the appended name will be "missing". This option changes the prefix of this appended column name.
Initial threshold methods
This section allows the user to choose the starting threshold value for all of the models chosen. The "Max" options allow for the user to use unique thresholds for each of the models, set to the value that maximizes their respective statistic measure. The options include "Max Youden's Index", "Max Sensitivity", "Max Specificity", "Max F-Measure", "Max Accuracy", and "Max Precision". The option to define a "Custom" initial threshold value is also available and will be applied uniformly to the predictions of all models. Any changes made to thresholds in the view (and subsequently saved) will override these options. The thresholds used in calculating statistics for each model can be found in the "Threshold" column of the 2nd output table.
Ignore missing values
Enabling this setting will skip rows with missing values and output a warning in the console as well as in the view. De-selecting this option will cause the execution of the node to fail if missing values are present.
Show warnings in view
If this option is available any warning messages from the execution of the node or generated in the view will be visible by selecting the yellow "!" control in the upper right hand corner of the view (it will not be visible if no warnings are present). De-selecting this option will not display any warnings.

Style

The options in this tab will modify general view settings.

Title
Sets the top most title of the chart. If left blank, nothing will be displayed.
Resize to window
If enabled, this setting will allow the view to adjust its size dynamically when the window is resize. Disabling this option will use the dimensions set in the subsequent fields.
Image width (px])
If "Resize to window" is disabled, the option sets the overall width of the view.
Image height (px])
If "Resize to window" is disabled, the option sets the overall height of the view.
True positive color
Sets the color of "True positive" classifications in the confusion matrix and the bars in the Dist Plot.
False negative color
Sets the color of "False negative" classifications in the confusion matrix and the bars in the Dist Plot.
False positive color
Sets the color of "False positive" classifications in the confusion matrix and the bars in the Dist Plot.
True negative color
Sets the color of "True negative" classifications in the confusion matrix and the bars in the Dist Plot.
Positive class color
Sets the color of positive class distribution in the Dist Plot.
Negative class color
Sets the color of negative class distribution in the Dist Plot.
Threshold line color
Sets the color of the moving threshold line in the Dist Plot. Also sets the color of the handle of the threshold manipulation slider (if shown).
Text color
Sets the color of the primary text in the chart.
Hover text color
Sets the color of the secondary text in the chart and hover information.
Background color
Sets the color of the overall background of the chart.
Hover background color
Sets the color of the secondary elements in the chart such as hover information backgrounds and buttons.
Grid color
Sets the color of the grid lines (if enabled) in the Bar Chart, the ROC, and the Dist plot.
Slider background color
Sets the color of the threshold slider background.

Model Stats Chart

This tab holds options for the Bar chart in the view.

Bar chart title
Sets the title at the top of the Bar chart.
Bar chart x-axis label
Sets the label on the x-axis of the Bar chart.
Bar chart y-axis label
Sets the label on the y-axis of the Bar chart.
Max # tick-marks
This option sets the maximum number of ticks to be displayed on the y-axis for the Bar chart. (Actual tick numbers may be lower due to data or interactions such as zooming)
Group bars by statistics
Enabling this option will group the bars in the Bar chart by the statistics measures (e.g. "Specificity" bars for all models together, then "Sensitivity" bars for all models together, etc.). De-selecting this option will group the bars by model (e.g. all statistics from Model #1 together, then all statistics from Model #2 together, etc.).
Show bar chart grid
Enabling this option will show grid lines for the Bar Chart. Disabling will prevent grid lines from being shown.
Enable bar chart panning controls
Enabling this option will enable panning behavior and controls in the upper left hand corner of the chart. Disabling will prevent panning and these controls from being shown.
Enable bar chart zoom controls
Enabling this option will enable zooming behavior, double-click to reset the zoom perspective and controls in the upper left hand corner of the chart. Disabling will prevent zooming, double-click to reset zoom perspective and these controls from being shown.
Enable bar chart tooltip controls
Enabling this option will enable controls for tooltip in the upper left hand corner of the chart. Disabling will prevent these controls from being shown.
Bar chart bars (per model)
This option allows the selection of which statistics to represent as individual bars for each of the models in the Bar chart. These options can be changed in the view (if the controls are enabled in the "Interactivity/Controls tab").

ROC Curve

This tab holds options for the ROC in the view.

ROC title
Sets the title at the top of the ROC.
ROC x-axis label
Sets the label on the x-axis of the ROC.
ROC y-axis label
Sets the label on the y-axis of the ROC.
X-Axis max # tick-marks
This option sets the maximum number of ticks to be displayed on the x-axis for the ROC. (Actual tick numbers may be lower due to data or interactions such as zooming)
Y-Axis max # tick-marks
This option sets the maximum number of ticks to be displayed on the y-axis for the ROC. (Actual tick numbers may be lower due to data or interactions such as zooming)
Show ROC grid
Enabling this option will show grid lines for the ROC. Disabling will prevent grid lines from being shown.
Enable ROC panning controls
Enabling this option will enable panning behavior and controls in the upper left hand corner of the chart. Disabling will prevent panning and these controls from being shown.
Enable ROC zoom controls
Enabling this option will enable zooming behavior, double-click to reset the zoom perspective and controls in the upper left hand corner of the chart. Disabling will prevent zooming, double-click to reset zoom perspective and these controls from being shown.
Enable ROC tooltip controls
Enabling this option will enable controls for tooltip in the upper left hand corner of the chart. Disabling will prevent these controls from being shown.
ROC hover text options
This option allows the selection of which information is shown when hovering over a given line. Selecting "X" will display the x-value of the nearest point in the tooltip, "Y" will display the y-value of the nearest point in the tooltip and "X+Y" will display both.

Confusion Matrix

This tab contains options for the Confusion Matrix in the view.

Display number of predictions
If this option is enabled, the total number of predictions being used for processing will be displayed as text next to the title. If disabled, no totals will be shown.
Display additional statistics confusion matrix rates
Enabling this option will display additional statistics around the right and bottom sides of the Confusion Matrix. Disabling this option will not display these additional statistics.
Confusion matrix title
Sets the title directly above the Confusion Matrix.

Prediction Distributions

This tab holds options for the Dist plot in the view.

Dist plot title
Sets the title at the top of the Dist plot.
Dist plot x-axis label
Sets the label on the x-axis of the Dist plot.
Dist plot y-axis label
Sets the label on the y-axis of the Dist plot.
X-Axis max # tick-marks
This option sets the maximum number of ticks to be displayed on the x-axis for the Dist plot. (Actual tick numbers may be lower due to data or interactions such as zooming)
Y-Axis max # tick-marks
This option sets the maximum number of ticks to be displayed on the y-axis for the Dist plot. (Actual tick numbers may be lower due to data or interactions such as zooming)
Number of bins
This option sets the number of the bins used in the Dist plot. Because the range of values is between 0 and 1 along the x-axis, the number of bins determines the size of each bin used to group the predictions. For example: using "100" will result in bins of size 0.01 being used in the plot. Note: too few bins can distort the perspective of the chart, just as bins too many can also cause the chart to be slow, jagged and otherwise difficult to use.
Show dist plot grid
Enabling this option will show grid lines for the Dist plot. Disabling will prevent grid lines from being shown.
Enable dist plot panning controls
Enabling this option will enable panning behavior and controls in the upper left hand corner of the chart. Disabling will prevent panning and these controls from being shown.
Enable dist plot zoom controls
Enabling this option will enable zooming behavior, double-click to reset the zoom perspective and controls in the upper left hand corner of the chart. Disabling will prevent zooming, double-click to reset zoom perspective and these controls from being shown.
Enable dist plot tooltip controls
Enabling this option will enable controls for tooltip in the upper left hand corner of the chart. Disabling will prevent these controls from being shown.
Plot element opacity
Sets the opacity of some elements in the Dist plot. Depending on the colors chosen the "tp", "fp", "tn", "fn" classifications, this setting can be used to enhance the contrast between different classifications in the Dist plot.
Visual elements
This options enables or disables the various visual elements in the Dist plot. These options can be changed in the view (if the corresponding controls in the "Interactivity/Controls" tab are enabled).
Dist plot hover text options
This option allows the selection of which information is shown when hovering over a given line. Selecting "X" will display the x-value of the nearest point in the tooltip, "Y" will display the y-value of the nearest point in the tooltip and "X+Y" will display both.

Interactivity/Controls

Settings for view interactivity and controls.

Publish selection
Selecting this option will publish KNIME selection events to other views in a composite/component view which are "Subscribed to selection". These selections can be made in the Confusion Matrix.
Change threshold via ROC interaction
Enabling this option will allow for click events to be registered on the ROC when a model has been selection. These events can change the current threshold of the selected model and update the views. NOTE: click selection on the ROC is currently not without its performance concerns. In the future, this may improve, but the most accurate way to modify the threshold of the selected model is with the slider (if enabled), click events on the Dist plot or with the optimization methods in the controls (if enabled); not with ROC selection.
Enable confusion matrix selection
Enabling this option will allow "classifications" to be selected on the confusion matrix. Multiple selections can be made by holding down the CTRL key while making selections. The rows associated with each of the classifications will be published to the other views in a composite/component view (if "Publish Selection" has been enabled) and will also be used to update the "Selected" (boolean) column in the output data. Points selected via the Confusion Matrix will be labeled as "true" in the "Selected" column, while those not selected will receive a label of "false". Points that are not present because of "Maximum number of rows" option limitations will be marked as missing in the "Selected" column. Disabling this option will prevent clicks on the Confusion Matrix from updating the selected points in the view.
Display fullscreen button
Enabling this option will enable a fullscreen button in the upper right corner of the view when the view is in a composite/component view . Disabling this option will not display this button.
Enable dist plot controls
Enabling this option will present in-view controls to toggle the visual elements of the Dist plot. Disabling this option will prevent these controls from being rendered.
Enable clear selection button
Enabling this option will display a button to clear the current selections in the Confusion Matrix if "Enable confusion matrix selection" is enabled.
Enable toggle 'Publish Selection' controls
Enabling this option will allow the toggling of the "Publish Selection" option in the view via a button in the upper right corner. Disabling this option will not allow the "Publish Selection" option to be changed from the view.
Display slider
Enabling this option will render a range slider below the Dist plot with which the threshold of the selected model and be directly manipulated. Disabling this option will prevent this slider from appearing in the view.
Step size
This option controls the size of the "steps" of the threshold range slider. This value must be between 0-1. Smaller values represent finer control over the threshold but also may impact performance.

Input Ports

Icon
Data table with a column for the ground truth (binary/nominal/class/0-1/etc.) and a column for each model containing the positive prediction probability of that model.
Icon
A table with one COLORED row for each model column (NOT ground truth) with a single column containing the name of the column.

Output Ports

Icon
The data table at this out-port is derived from the data input at in-port #0 and can have a number of options. It will always contain:
  • The ground truth column selected in the dialog.
  • The chosen model-prediction columns chosen in the dialog.
  • A boolean "selected" column with a value of "true" or "false" depending on whether or not the row was selected in the confusion matrix in the view. The value will be "missing" if no selections were made or the view was not opened.

Optionally, the data table may also contain:
  • A String-value column for EACH of the model-prediction columns chosen in the dialog. Each column will contain the label (either positive class or negative class) depending on the threshold for that model. The thresholds will either be taken from the dialog "Threshold Method" option OR be taken from the most recent value saved from the view for that model. To see which threshold was used to label each column, reference the row of statistics relating to each model in the 2nd out-port. To enable this option select the "Append new predictions for all models" option in the dialog.
  • A single String-value column for the selected model in the view. This column will contain the label (either positive class or negative class) depending on the threshold for the selected model in the view. This column will contain "missing values" if no model was selected in the view. The name of this column can also be set in the dialog to assist in downstream workflow processes. The selected model column name and threshold will also be output as flow variables from this node.
    To enable this option select the "Append prediction column for the selected model" option in the dialog.

The final option for this table is to exclude tables that are not the ground truth column or the selected model-prediction column(s). This will filter out any unwanted columns which may have been left over from upstream predictors. This option is NOT enabled by default. In order to take advantage of this option DESELECT the "Retain non-prediction columns in output table" option in the dialog.
Icon
Table containing a single row of statistics for each of the models chosen. This tables also contains a column corresponding to the threshold value used in the calculation of the statistics in this table and the (potential) labeling of the output data.

Views

Interactive View: Binary Classification Inspector
Binary Classification Inspector

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.