Duplicate Row Filter

This node identifies duplicate rows. Duplicate rows have identical values in certain columns. The node chooses a single row for each set of duplicates ("chosen"). You can either remove all duplicate rows from the input table and keep only unique and chosen rows or mark the rows with additional information about their duplication status.

Options

Choose columns for duplicates detection
Allows the selection of columns identifying the duplicates. Columns not selected are handled under "Row selection" in the "Advanced" tab.

Advanced - Duplicate Rows

Remove duplicates
Removes duplicate rows and keeps only unique and chosen rows.
Keep duplicate rows
Appends columns with additional information to the input table:
  • Add classification column: Appends a column that describes the type of row:
    unique: There is no other row with the same values in the selected columns.
    chosen: This row was chosen from a set of duplicate rows.
    duplicate: This row is a duplicate and represented by another row.
  • Add ROWID column: Appends a column with the ROWID of the chosen row for duplicate rows. Unique and chosen rows will not have a ROWID assigned.

Advanced - Row selection

Select row:
Defines which row for each set of duplicates is selected:
  • First:The first row in sequence is chosen.
  • Last:The last row in sequence is chosen.
  • Minimum of:The first row with the minimum value in the selected column is chosen. In case of strings, the row will be chosen following lexicographical order. Missing values are sorted after the maximum value.
  • Maximum of:The first row with the maximum value in the selected column is chosen. In case of strings, the row will be chosen following lexicographical order. Missing values are sorted before the minimum value.

Advanced - Additional options

In-memory computation
If selected, computation is speed up by utilizing working memory (RAM). The amount of required memory is higher than for a regular computation and also depends on the amount of input data.
Retain row order
If selected, rows in the output table are sorted in the same order as in the input table.

Input Ports

Icon
The data table containing potential duplicates.

Output Ports

Icon
Either the input data without duplicates or the input data with additional columns identifying duplicates.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.