Numeric Outliers

This node detects and treats the outliers for each of the selected columns individually by means of interquartile range (IQR).

To detect the outliers for a given column, the first and third quartile (Q1, Q3) is computed. An observation is flagged an outlier if it lies outside the range R = [Q1 - k(IQR), Q3 + k(IQR)] with IQR = Q3 - Q1 and k >= 0. Setting k = 1.5 the smallest value in R corresponds, typically, to the lower end of a boxplot's whisker and largest value to its upper end.
Providing grouping information allows to detect outliers only within their respective groups.

If an observation is flagged an outlier, one can either replace it by some other value or remove/retain the corresponding row.

Missing values contained in the data will be ignored, i.e., they will neither be used for the outlier computation nor will they be flagged as an outlier.

Options

Outlier selection
Allows the selection of columns for which outliers have to be detected and treated. If "Compute outlier statistics on groups" is selected, the outliers for each of the columns are computed solely with respect to the different groups.
Interquartile range multiplier (k)
Allows scaling the interquartile range (IQR). The default is k = 1.5. Larger values will cause less values to be considered outliers.
Quartile calculation
Allows to specify how the quartiles are computed.
  • Full data estimate: This option typically creates more accurate results than its counterpart, but also requires far more additional memory. Therefore, we recommend this option for smaller data sets.
    Since the value of the quartiles often lies between two observations, this option additionally allows to specify how the actual value is computed, which is encoded by the various estimation types (LEGACY, R_1, ..., R_9). A detailed explanation of the different types can be found here.
  • Use heuristic: This option ensure that the quartiles are calculated using a heuristical approach. This choice is recommended for large data sets due to its low memory requirements. However, for small data sets the results of this approach can be quite far away from the accurate results.
Estimation type
Specifies how the actual quartile value is computed when using full data estimate. A detailed explanation of the different types can be found here.
Update domain
If checked the domain of the selected outlier columns is updated.
Apply to
Allows to apply the selected treatment strategy to
  • All outliers: Do not restrict outlier detection.
  • Outliers below lower bound: Restrict outlier detection to values below the lower bound.
  • Outliers above upper bound: Restrict outlier detection to values above the upper bound.
Treatment option
Defines three different strategies to treat outliers:
  • Replace outlier values: Allows to replace outliers based on the selected "Replacement strategy".
  • Remove outlier rows: Removes all rows from the input data that contain in any of the selected columns at least one outlier.
  • Remove non-outlier rows: Retains only those rows of the input data that contain at least one outlier in any of the selected columns.
Replacement strategy
Defines two different strategies to replace outliers:
  • Missing values: Replaces every outlier by a missing value.
  • Closest permitted value: Replaces the value of each outlier by the closest value within the permitted interval R. If the column type is an integer the replacement value is the closest integer within the permitted interval.
Compute outlier statistics on groups
If selected, allows the selection of columns to identify groups. A group comprises all rows of the input exhibiting the same values in every single column. The outliers will finally be computed with respect to each of the individual groups.
Group columns
Move the columns defining the groups into the Include list. The group definition will take priority, i.e. if a column is selected for both group definition and outlier handling, it will be used to define groups (no outlier handling done for that column).
Process groups in memory
Processes the groups in the memory. This option comes with higher memory requirements, but is faster since the table does not need any additional treatment.

Input Ports

Icon
Numeric input data to evaluate + optional group information.

Output Ports

Icon
Data table where outliers were either replaced or rows containing outliers/non-outliers were removed.
Icon
Data table holding the number of members, i.e., non-missing values and outliers as well as the lower and upper bound for each outlier groups.
Icon
Model holding the permitted interval bounds for each outlier group and the outlier treatment specifications.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.