Icon

kn_​example_​duplicates

Remove rows with duplicate values

I set up a workflow to demonstrate how this could be done

- use group by to calculate how many duplicates there are (note: KNIME should introduce a generic COUNT(*) function - I had to use a variable)
- if the count is larger then 1 it is a duplicate
- left join it back to the original data
- sort the data by ID and other variables if you want to keep one of the duplicates
- use the LAG column to identify which line is a 2nd, 3rd occurrence of a duplicate
- make a rule to keep just a single line of each ID
- alternative: just remove all duplicates

enter a rule which doubleshould stay I set up a workflow to demonstrate how this could be done- use group by to calculate how many duplicates there are (note: KNIME shouldintroduce a generic COUNT(*) function - I had to use a variable)- if the count is larger then 1 it is a duplicate- left join it back to the original data- sort the data by ID and other variables if you want to keep one of the duplicates- use the LAG column to identify which line is a 2nd, 3rd occurrence of a duplicate- make a rule to keep just a single line of each ID- alternative: just remove all duplicates Remove rows with duplicate values original datacount no of IDsno_idsis_doubleLEFT join togetherID with youngest year on topLAG the is_doubleto_keepremove duplicate rowsbut keep one of themremove ALL doubles Table Creator GroupBy Column Rename Rule Engine Joiner Sorter Lag Column Rule Engine Rule-basedRow Filter Column Filter Rule-basedRow Filter enter a rule which doubleshould stay I set up a workflow to demonstrate how this could be done- use group by to calculate how many duplicates there are (note: KNIME shouldintroduce a generic COUNT(*) function - I had to use a variable)- if the count is larger then 1 it is a duplicate- left join it back to the original data- sort the data by ID and other variables if you want to keep one of the duplicates- use the LAG column to identify which line is a 2nd, 3rd occurrence of a duplicate- make a rule to keep just a single line of each ID- alternative: just remove all duplicates Remove rows with duplicate values original datacount no of IDsno_idsis_doubleLEFT join togetherID with youngest year on topLAG the is_doubleto_keepremove duplicate rowsbut keep one of themremove ALL doubles Table Creator GroupBy Column Rename Rule Engine Joiner Sorter Lag Column Rule Engine Rule-basedRow Filter Column Filter Rule-basedRow Filter

Nodes

Extensions

Links