2 ×

Duplicate Row Filter

KNIME Base Nodes version 4.1.3.v202005112252 by KNIME AG, Zurich, Switzerland

This node identifies duplicate rows. Duplicate rows have identical values in certain columns. The node chooses a single row for each set of duplicates ("chosen"). You can either remove all duplicate rows from the input table and keep only unique and chosen rows or mark the rows with additional information about their duplication status.

Options

Choose columns for duplicates detection
Allows the selection of columns identifying the duplicates. Columns not selected are handled under "Row selection" in the "Advanced" tab.

Advanced - Duplicate Rows

Remove duplicates
Removes duplicate rows and keeps only unique and chosen rows.
Keep duplicate rows
Appends columns with additional information to the input table:
  • Add classification column: Appends a column that describes the type of row:
    unique: There is no other row with the same values in the selected columns.
    chosen: This row was chosen from a set of duplicate rows.
    duplicate: This row is a duplicate and represented by another row.
  • Add ROWID column: Appends a column with the ROWID of the chosen row for duplicate rows. Unique and chosen rows will not have a ROWID assigned.

Advanced - Row selection

Select row:
Defines which row for each set of duplicates is selected:
  • First:The first row in sequence is chosen.
  • Last:The last row in sequence is chosen.
  • Minimum of:The first row with the minimum value in the selected column is chosen. In case of strings, the row will be chosen following lexicographical order. Missing values are sorted after the maximum value.
  • Maximum of:The first row with the maximum value in the selected column is chosen. In case of strings, the row will be chosen following lexicographical order. Missing values are sorted before the minimum value.

Advanced - Additional options

In-memory computation
If selected, computation is speed up by utilizing working memory (RAM). The amount of required memory is higher than for a regular computation and also depends on the amount of input data.
Retain row order
If selected, rows in the output table are sorted in the same order as in the input table.

Input Ports

The data table containing potential duplicates.

Output Ports

Either the input data without duplicates or the input data with additional columns identifying duplicates.

Workflows

Installation

To use this node in KNIME, install KNIME Core from the following update site:

KNIME 4.1

A zipped version of the software site can be downloaded here. Read our FAQs to get instructions about how to install nodes from a zipped update site.

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.