Icon

kn_​example_​duplicates_​school_​of_​db_​h2

School of duplicates - and how to deal with them - H2 version

Dealing with duplicates is a constant theme with data scientist. And a lot of things can go wrong. The easienst ways to deal with them is GROUP BY or DISTINCT. Just get rid of them and be done. But as this examples might demonstrate this might not always be the best option. Even if your data provider swears your combined IDs are unique especially in Big Data scenarios there might still be lurking some muddy duplicates and you shoudl still be able to deal with them.

And you should be able to bring a messy dataset into a meaningful table with a nice unique ID without loosing too much information. And this workflow would like to encourage you to think about what to do with your duplicates and not to get caught off guard but to take control :-)



SQL's row_number() ist your friend School of duplicates - and how to deal with them - H2 versionDealing with duplicates is a constant theme with data scientist. And a lot of things can go wrong. The easienst ways to deal with them is GROUP BY or DISTINCT. Just get rid of them and be done. But as this examples might demonstrate this might not always be the best option. Even if your data provider swears your combined IDs are unique especially in Big Datascenarios there might still be lurking some muddy duplicates and you shoudl still be able to deal with them.And you should be able to bring a messy dataset into a meaningful table with a nice unique ID without loosing too much information. And this workflow would like to encourage you to think about what to do with your duplicates and not to get caught off guard but to take control :-) SELECT t2.* , CONCAT(TRIM(t2."street"), ' - ', TRIM(t2."town")) AS longest_adress_string-- mark the Top row per ID as the chosen one , CASE WHEN t2."rank_id" =1 THEN 'chosen' ELSE 'duplicate' END AS "duplicate-type-classifier"FROM -- start t2( SELECT * , ROW_NUMBER() OVER (-- set which unique ID should be kept - all others will be deleted -- this makes sure even if there is a *complete* duplicate only one row will remain (and be intact) PARTITION BY t1."ID" -- set rules which ID should remain on Top in this case the street and town adress with the longest entry ORDER BY (LENGTH(t1."street") + LENGTH(t1."town")) DESC ) AS "rank_id"FROM #table# AS t1) t2-- end t2ORDER BY t2."ID" , t2."rank_id" SELECT t2.* /* mark the Top row per ID as the chosen one */, CASE WHEN t2."rank_id" =1 THEN 'chosen' ELSE 'duplicate' END AS "duplicate_type_classifier"FROM /* start t2 */( SELECT * , ROW_NUMBER() OVER (/* set which unique ID should be kept - all others will be deleted *//* this makes sure even if there is a *complete* duplicate only one row will remain (and be intact) */ PARTITION BY t1."ID" /* set rules which ID should remain on Top in this case the one with the latest enry and then the last contact */ ORDER BY t1."entry_date" DESC NULLS LAST , t1."last_contact" DESC NULLS LAST , t1."purchase" DESC NULLS LAST ) AS "rank_id"FROM #table# AS t1) t2/* end t2 */ORDER BY t2."ID" , t2."rank_id" alternative: SQL Executor withDROP TABLE IF EXISTS default.data_all;the Magic Queryto find the latest entry$duplicate-type-classifier$ = "chosen" => TRUEthe Magic Querykeep the longest adress stringresult of row_id Magic$duplicate-type-classifier$ = "chosen" => TRUEthe longestadressdata_alldata_allkeep the latest entrydata.xlsxsql_01extract the SQL codewith which to create an emptytableSELECT * FROM #table# AS t1WHERE 1=2create empty tablevia SQL stringdata_all_deduplicate_01knime://knime.workflow/data/database.h2=> make sure to use the latest driverversion >=1.4.200data_all_deduplicate_01data_all_deduplicate_01APPENDdata_all_deduplicate_01data_all_deduplicate_01h2_version1.4.1961.4.2002.1.212data_allcreate empty table DB Table Remover DB Query Rule-basedRow Filter DB Query DB Reader Rule-basedRow Filter Column Filter DB Loader DB Table Selector Column Filter Excel Reader String to Date&Time DB Query Extractor DB Query DB SQL Executor DB Reader H2 Connector DB Table Selector DB Reader DB ConnectionTable Writer DB Reader DB Query Reader DB Table Creator SQL's row_number() ist your friend School of duplicates - and how to deal with them - H2 versionDealing with duplicates is a constant theme with data scientist. And a lot of things can go wrong. The easienst ways to deal with them is GROUP BY or DISTINCT. Just get rid of them and be done. But as this examples might demonstrate this might not always be the best option. Even if your data provider swears your combined IDs are unique especially in Big Datascenarios there might still be lurking some muddy duplicates and you shoudl still be able to deal with them.And you should be able to bring a messy dataset into a meaningful table with a nice unique ID without loosing too much information. And this workflow would like to encourage you to think about what to do with your duplicates and not to get caught off guard but to take control :-) SELECT t2.* , CONCAT(TRIM(t2."street"), ' - ', TRIM(t2."town")) AS longest_adress_string-- mark the Top row per ID as the chosen one , CASE WHEN t2."rank_id" =1 THEN 'chosen' ELSE 'duplicate' END AS "duplicate-type-classifier"FROM -- start t2( SELECT * , ROW_NUMBER() OVER (-- set which unique ID should be kept - all others will be deleted -- this makes sure even if there is a *complete* duplicate only one row will remain (and be intact) PARTITION BY t1."ID" -- set rules which ID should remain on Top in this case the street and town adress with the longest entry ORDER BY (LENGTH(t1."street") + LENGTH(t1."town")) DESC ) AS "rank_id"FROM #table# AS t1) t2-- end t2ORDER BY t2."ID" , t2."rank_id" SELECT t2.* /* mark the Top row per ID as the chosen one */, CASE WHEN t2."rank_id" =1 THEN 'chosen' ELSE 'duplicate' END AS "duplicate_type_classifier"FROM /* start t2 */( SELECT * , ROW_NUMBER() OVER (/* set which unique ID should be kept - all others will be deleted *//* this makes sure even if there is a *complete* duplicate only one row will remain (and be intact) */ PARTITION BY t1."ID" /* set rules which ID should remain on Top in this case the one with the latest enry and then the last contact */ ORDER BY t1."entry_date" DESC NULLS LAST , t1."last_contact" DESC NULLS LAST , t1."purchase" DESC NULLS LAST ) AS "rank_id"FROM #table# AS t1) t2/* end t2 */ORDER BY t2."ID" , t2."rank_id" alternative: SQL Executor withDROP TABLE IF EXISTS default.data_all;the Magic Queryto find the latest entry$duplicate-type-classifier$ = "chosen" => TRUEthe Magic Querykeep the longest adress stringresult of row_id Magic$duplicate-type-classifier$ = "chosen" => TRUEthe longestadressdata_alldata_allkeep the latest entrydata.xlsxsql_01extract the SQL codewith which to create an emptytableSELECT * FROM #table# AS t1WHERE 1=2create empty tablevia SQL stringdata_all_deduplicate_01knime://knime.workflow/data/database.h2=> make sure to use the latest driverversion >=1.4.200data_all_deduplicate_01data_all_deduplicate_01APPENDdata_all_deduplicate_01data_all_deduplicate_01h2_version1.4.1961.4.2002.1.212data_allcreate empty tableDB Table Remover DB Query Rule-basedRow Filter DB Query DB Reader Rule-basedRow Filter Column Filter DB Loader DB Table Selector Column Filter Excel Reader String to Date&Time DB Query Extractor DB Query DB SQL Executor DB Reader H2 Connector DB Table Selector DB Reader DB ConnectionTable Writer DB Reader DB Query Reader DB Table Creator

Nodes

Extensions

Links