Duplicate Row Filter

This node identifies duplicate rows. Duplicate rows have identical values in certain columns. The node chooses a single row for each set of duplicates ("chosen"). You can either remove all duplicate rows from the input table and keep only unique and chosen rows or mark the rows with additional information about their duplication status.

Options

Duplicate detection

Choose columns for duplicates detection
Allows the selection of columns identifying the duplicates. Columns not selected are handled under "Row selection" in the "Advanced" tab.

Duplicate handling

Duplicate rows
  • Remove duplicate rows: Removes duplicate rows and keeps only unique and chosen rows.
  • Keep duplicate rows: Appends columns with additional information to the input table.
Row chosen in case of duplicate
  • First: The first row in sequence is chosen.
  • Last: The last row in sequence is chosen.
  • Minimum of: The first row with the minimum value in the selected column is chosen. In case of strings, the row will be chosen following lexicographical order. Missing values are sorted after the maximum value.
  • Maximum of: The first row with the maximum value in the selected column is chosen. In case of strings, the row will be chosen following lexicographical order. Missing values are sorted before the minimum value.

Performance

Compute in memory
Advanced setting that, if selected, computation is sped up by utilizing working memory (RAM). The amount of required memory is higher than for a regular computation and also depends on the amount of input data.
Retain row order
Advanced setting that, if selected, the rows in the output table are guaranteed to have the same order as in the input table.
Update domains of all columns
Advanced setting to enable recomputation of the domains of all columns in the output tables such that the domains' bounds exactly match the bounds of the data in the output tables.

Input Ports

Icon
The data table containing potential duplicates.

Output Ports

Icon
Either the input data without duplicates or the input data with additional columns identifying duplicates.

Popular Successors

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.