Binner (Dictionary)

Categorizes values in a column according to a dictionary table with min/max values. The table at the first input contains a column with values to be categorized. The second table contains a column with lower bound values, a column with upper bound values and a column with label values. The latter will be used as outcome in case a given value is between the corresponding lower and upper bound. Each row in the second table represents a rule, whereby the rules are evaluated top-down, i.e. rules with low row index have higher priority than rules in the subsequent rows.

Either the lower or upper bound test can be disabled by unsetting the corresponding checkbox in the dialog. Missing values in the columns containing upper and lower bounds will always evaluate the bound check to true. That is, a missing value in the lower bound column will always be smaller and a missing value in the upper bound column will always be larger than the value. Missing values in the value column (1st input) will result in a missing cell output (no categorization).

Note: The table containing bound and label information (2nd input) will be read into memory during execution; it must be a relatively small table!

Options

Value Column to bin (1st port)
Select the column in the first input table containing the values to be categorized.
Lower Bound Column (2nd port)
If enabled, select the column containing the lower bound values. Choose whether a value must be strictly smaller or smaller or equal than the lower bound value by selecting the "Inclusive" checkbox (if checked it will be smaller or equal).
Upper Bound Column (2nd port)
If enabled, select the column containing the upper bound values. Choose whether a value must be strictly larger or larger or equal than the upper bound value by selecting the "Inclusive" checkbox (if checked it will be larger or equal).
Label Column (2nd port)
Select the label column from the 2nd input that will be appended as category column to the output table.
If no rule matches
Choose the behavior if none of the rules in the 2nd input fires for a given value: Select "Fail" to make this node fail during execution (reasonable when the rule table is assumed to cover the entire domain) or "Insert Missing" to insert missing values as result.
Search Pattern
Linear Search scans all rules sequentially in the order they are defined in the rule table and returns the label of the first rule that matches. Binary search only works if both limits are specified (and not missing); it sorts the rules based on their lower and upper limits and performs a binary search to find the matching label. Binary search might not be deterministic if rules overlap. It is much faster if there is a large rule set (tens of thousands or millions). If in doubt, use linear search.

Input Ports

Icon
Arbitrary input data with column to be binned.
Icon
Table containing categorization rules with lower and upper bound and the label column.

Output Ports

Icon
Input table amended by column with categorization values.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.