Numeric Outliers

This node detects and treats the outliers for each of the selected columns individually by means of interquartile range (IQR).

To detect the outliers for a given column, the first and third quartile (Q1, Q3) is computed. An observation is flagged an outlier if it lies outside the range R = [Q1 - k(IQR), Q3 + k(IQR)] with IQR = Q3 - Q1 and k >= 0. Setting k = 1.5 the smallest value in R corresponds, typically, to the lower end of a boxplot's whisker and largest value to its upper end.
Providing grouping information allows to detect outliers only within their respective groups.

If an observation is flagged an outlier, one can either replace it by some other value or remove/retain the corresponding row.

Missing values contained in the data will be ignored, i.e., they will neither be used for the outlier computation nor will they be flagged as an outlier.

Options

Outlier Selection

Outlier selection
Allows the selection of columns for which outliers have to be detected and treated. If "Compute outlier statistics on groups" (see tab "Group Settings") is selected, the outliers for each of the columns are computed solely with respect to the different groups.

General Settings

interquartile range multiplier (k)
Allows scaling the interquartile range (IQR). The default is k = 1.5. Larger values will cause less values to be considered outliers.
Quartile calculation
Allows to specify how the quartiles are computed.
  • Use heuristic (memory friendly): This option ensure that the quartiles are calculated using a heuristical approach. This choice is recommended for large data sets due to its low memory requirements. However, for small data sets the results of this approach can be quite far away from the accurate results.
  • Full data estimate using: This option typically creates more accurate results than its counterpart, but also requires far more additional memory. Therefore, we recommend this option for smaller data sets.
    Since the value of the quartiles often lies between two observations, this option additionally allows to specify how the actual value is computed, which is encoded by the various estimation types (LEGACY, R_1, ..., R_9). A detailed explanation of the different types can be found here.
Update domain
If checked the domain of the selected outlier columns is updated.

Outlier Treatment

Apply to
Allows to apply the selected treatment strategy to
  • All outliers
  • Outliers below lower bound
  • Outliers above upper bound
Treatment option
Defines three different strategies to treat outliers:
  • Replace outlier values: Allows to replace outliers based on the selected "Replacement strategy"
  • Remove outlier rows: Removes all rows from the input data that contain in any of the selected columns at least one outlier
  • Remove non-outlier rows: Retains only those rows of the input data that contain at least one outlier in any of the selected columns
Replacement strategy
Defines two different strategies to replace outliers:
  • Missing values: Replaces every outlier by a missing value
  • Closest permitted value: Replaces the value of each outlier by the closest value within the permitted interval R. If the column type is an integer the replacement value is the closest integer within the permitted interval.
Note that this option is only enabled if outliers have to be replaced.

Group Selection

Compute outlier statistics on groups
If selected, allows the selection of columns to identify groups. A group comprises all rows of the input exhibiting the same values in every single column (similar to the GroupBy node). The outliers will finally be computed with respect to each of the individual groups.
Column Filter
Move the columns defining the groups into the Include list. The group definition will take priority, i.e. if a column is selected for both group definition and outlier handling, it will be used to define groups (no outlier handling done for that column).

Memory Policy

Process groups in memory
Processes the groups in the memory. This option comes with higher memory requirements, but is faster since the table does not need any additional treatment.

Input Ports

Icon
Numeric input data to evaluate + optional group information

Output Ports

Icon
Data table where outliers were either replaced or rows containing outliers/non-outliers were removed
Icon
Data table holding the number of members, i.e., non-missing values and outliers as well as the lower and upper bound for each outlier groups
Icon
Model holding the permitted interval bounds for each outlier group and the outlier treatment specifications

Popular Successors

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.