Icon

kn_​example_​duplicates_​school_​of

School of duplicates - and how to deal with them

Dealing with duplicates is a constant theme with data scientist. And a lot of things can go wrong. The easienst ways to deal with them is GROUP BY or DISTINCT. Just get rid of them and be done. But as this examples might demonstrate this might not always be the best option. Even if your data provider swears your combined IDs are unique especially in Big Data scenarios there might still be lurking some muddy duplicates and you shoudl still be able to deal with them.

And you should be able to bring a messy dataset into a meaningful table with a nice unique ID without loosing too much information. And this workflow would like to encourage you to think about what to do with your duplicates and not to get caught off guard but to take control :-)


SQL's row_number() ist yourfriend SELECT `t2`.* -- mark the Top row per ID as the chosen one ..., CASE WHEN `t2`.`rank_id` =1 THEN 'chosen' ELSE 'duplicate' END AS `duplicate-type-classifier`FROM -- start t2( SELECT * -- also you could derive a FIRST_VALUE or LAST_VALUE within your group and add that to the table, FIRST_VALUE(`t1`.`last_contact`) OVER (PARTITION BY `t1`.`ID` ORDER BY `t1`.`last_contact` DESC)as very_last_contact, ROW_NUMBER() OVER (-- set which unique ID should be kept - all others will be deleted -- this makes sure even if there is a *complete* duplicate only one row will remain (and be intact) PARTITION BY `t1`.`ID` -- set rules which ID should remain on Top in this case the one with the latest enry and then the last contact ORDER BY `t1`.`entry_date` DESC , `t1`.`last_contact` DESC ) AS `rank_id`FROM #table# AS `t1`) `t2`-- end t2ORDER BY `t2`.`ID` , `t2`.`rank_id` SELECT `t2`.* , CONCAT(TRIM(`t2`.`street`), " - ", TRIM(`t2`.`town`)) AS longest_adress_string-- mark the Top row per ID as the chosen one , CASE WHEN `t2`.`rank_id` =1 THEN 'chosen' ELSE 'duplicate' END AS `duplicate-type-classifier`FROM -- start t2( SELECT * , ROW_NUMBER() OVER (-- set which unique ID should be kept - all others will be deleted -- this makes sure even if there is a *complete* duplicate only one row will remain (and be intact) PARTITION BY `t1`.`ID` -- set rules which ID should remain on Top in this case the street and town adress with the longest entry ORDER BY (LENGTH(`t1`.`street`) + LENGTH(`t1`.`town`)) DESC ) AS `rank_id`FROM #table# AS `t1`) `t2`-- end t2ORDER BY `t2`.`ID` , `t2`.`rank_id` School of duplicates - and how to deal with themDealing with duplicates is a constant theme with data scientist. And a lot of things can go wrong. The easienst ways to deal with them is GROUP BY or DISTINCT. Just get rid of them and be done. But as this examples might demonstrate this might not always be the best option. Even if your data provider swears your combined IDs areunique especially in Big Data scenarios there might still be lurking some muddy duplicates and you shoudl still be able to deal with them.And you should be able to bring a messy dataset into a meaningful table with a nice unique ID without loosing too much information. And this workflow would like to encourage you to think about what to do with your duplicates and not to get caught off guard but to take control :-) use KNIME's advanced duplicate remover (much like the row_number() trick described below). You can choose one line based on another variable The simplest form of duplicate removal.Group by a single ID with functionssum, min, maxwhat's not to like Take it up a notch. Aggregate meaningful information into one single line per ID while keeping the information from several sub-rows keep the highest purchaseadvanced settings$duplicate-type-classifier$ = "chosen" => TRUEdata_allset a partitioning command on the tab" Additional Options" PARTITIONED BY (education STRING)alternative: SQL Executor withDROP TABLE IF EXISTS default.data_all;the Magic Queryto find the latest entrySecond_key COLLECT_SETeach Item will appear onceresult of row_id MagicAggregate all purchases to ID$duplicate-type-classifier$ = "chosen" => TRUEthe Magic Querykeep the longest adress stringresult of row_id Magic$duplicate-type-classifier$ = "chosen" => TRUEthe longestadressHighest Purchase everper IDAll Second_keysever reportedthe latest entryJoinPurchase statisticswith the latest entryJoinHighest Purchase everJoinlongest adress entryJoinall Second_keys ever recordedAggregate all purchases to ID and Second_keybring the string togetherpurchase_by_2nd_keyAggregate all purchases to ID and Second_keydelete empty Second_keyslist_sum_of_purchasesJoinlist_sum_of_purchasesof all 2nd keys=> deletes the wholelocal big data folder/big_dataif you encouter any problems, closeKNIME and delete all data from the folder/big_data/data_alldata/data.xlsxload datainto hivetabledata/result.xlsx DuplicateRow Filter Rule-basedRow Filter DB Table Creator DB Table Remover DB Query DB GroupBy DB Reader DB Reader GroupBy Rule-basedRow Filter DB Query DB Reader Rule-basedRow Filter Column Filter Column Filter Column Rename Column Rename Column Filter Joiner Joiner Column Resorter Joiner Joiner GroupBy String Manipulation GroupBy Rule-basedRow Filter Column Rename Joiner local big datacontext create DB Table Selector Excel Reader DB Loader Excel Writer SQL's row_number() ist yourfriend SELECT `t2`.* -- mark the Top row per ID as the chosen one ..., CASE WHEN `t2`.`rank_id` =1 THEN 'chosen' ELSE 'duplicate' END AS `duplicate-type-classifier`FROM -- start t2( SELECT * -- also you could derive a FIRST_VALUE or LAST_VALUE within your group and add that to the table, FIRST_VALUE(`t1`.`last_contact`) OVER (PARTITION BY `t1`.`ID` ORDER BY `t1`.`last_contact` DESC)as very_last_contact, ROW_NUMBER() OVER (-- set which unique ID should be kept - all others will be deleted -- this makes sure even if there is a *complete* duplicate only one row will remain (and be intact) PARTITION BY `t1`.`ID` -- set rules which ID should remain on Top in this case the one with the latest enry and then the last contact ORDER BY `t1`.`entry_date` DESC , `t1`.`last_contact` DESC ) AS `rank_id`FROM #table# AS `t1`) `t2`-- end t2ORDER BY `t2`.`ID` , `t2`.`rank_id` SELECT `t2`.* , CONCAT(TRIM(`t2`.`street`), " - ", TRIM(`t2`.`town`)) AS longest_adress_string-- mark the Top row per ID as the chosen one , CASE WHEN `t2`.`rank_id` =1 THEN 'chosen' ELSE 'duplicate' END AS `duplicate-type-classifier`FROM -- start t2( SELECT * , ROW_NUMBER() OVER (-- set which unique ID should be kept - all others will be deleted -- this makes sure even if there is a *complete* duplicate only one row will remain (and be intact) PARTITION BY `t1`.`ID` -- set rules which ID should remain on Top in this case the street and town adress with the longest entry ORDER BY (LENGTH(`t1`.`street`) + LENGTH(`t1`.`town`)) DESC ) AS `rank_id`FROM #table# AS `t1`) `t2`-- end t2ORDER BY `t2`.`ID` , `t2`.`rank_id` School of duplicates - and how to deal with themDealing with duplicates is a constant theme with data scientist. And a lot of things can go wrong. The easienst ways to deal with them is GROUP BY or DISTINCT. Just get rid of them and be done. But as this examples might demonstrate this might not always be the best option. Even if your data provider swears your combined IDs areunique especially in Big Data scenarios there might still be lurking some muddy duplicates and you shoudl still be able to deal with them.And you should be able to bring a messy dataset into a meaningful table with a nice unique ID without loosing too much information. And this workflow would like to encourage you to think about what to do with your duplicates and not to get caught off guard but to take control :-) use KNIME's advanced duplicate remover (much like the row_number() trick described below). You can choose one line based on another variable The simplest form of duplicate removal.Group by a single ID with functionssum, min, maxwhat's not to like Take it up a notch. Aggregate meaningful information into one single line per ID while keeping the information from several sub-rows keep the highest purchaseadvanced settings$duplicate-type-classifier$ = "chosen" => TRUEdata_allset a partitioning command on the tab" Additional Options" PARTITIONED BY (education STRING)alternative: SQL Executor withDROP TABLE IF EXISTS default.data_all;the Magic Queryto find the latest entrySecond_key COLLECT_SETeach Item will appear onceresult of row_id MagicAggregate all purchases to ID$duplicate-type-classifier$ = "chosen" => TRUEthe Magic Querykeep the longest adress stringresult of row_id Magic$duplicate-type-classifier$ = "chosen" => TRUEthe longestadressHighest Purchase everper IDAll Second_keysever reportedthe latest entryJoinPurchase statisticswith the latest entryJoinHighest Purchase everJoinlongest adress entryJoinall Second_keys ever recordedAggregate all purchases to ID and Second_keybring the string togetherpurchase_by_2nd_keyAggregate all purchases to ID and Second_keydelete empty Second_keyslist_sum_of_purchasesJoinlist_sum_of_purchasesof all 2nd keys=> deletes the wholelocal big data folder/big_dataif you encouter any problems, closeKNIME and delete all data from the folder/big_data/data_alldata/data.xlsxload datainto hivetabledata/result.xlsxDuplicateRow Filter Rule-basedRow Filter DB Table Creator DB Table Remover DB Query DB GroupBy DB Reader DB Reader GroupBy Rule-basedRow Filter DB Query DB Reader Rule-basedRow Filter Column Filter Column Filter Column Rename Column Rename Column Filter Joiner Joiner Column Resorter Joiner Joiner GroupBy String Manipulation GroupBy Rule-basedRow Filter Column Rename Joiner local big datacontext create DB Table Selector Excel Reader DB Loader Excel Writer

Nodes

Extensions

Links