Icon

DB SQL - H2 School of duplicates - and how to deal with them

<p>School of duplicates - and how to deal with them - H2 version</p><p><br>Dealing with duplicates is a constant theme with data scientist. And a lot of things can go wrong. The easienst ways to deal with them is GROUP BY or DISTINCT. Just get rid of them and be done. But as this examples might demonstrate this might not always be the best option. Even if your data provider swears your combined IDs are unique especially in Big Data scenarios there might still be lurking some muddy duplicates and you shoudl still be able to deal with them. And you should be able to bring a messy dataset into a meaningful table with a nice unique ID without loosing too much information. And this workflow would like to encourage you to think about what to do with your duplicates and not to get caught off guard but to take control :-)</p>

URL: long forum debate about duplicates https://forum.knime.com/t/remove-rows-with-duplicate-values/11105/15?u=mlauber71
URL: School of duplicates - and how to deal with them (corresponding article) https://forum.knime.com/t/school-of-duplicates-and-how-to-deal-with-them/24164?u=mlauber71
URL: Window functions with new DB drivers https://forum.knime.com/t/sqlite-and-window-functions/31608/4?u=mlauber71
URL: A meta collection of KNIME and databases (SQL, Big Data/Hive/Impala and Spark/PySpark) https://hub.knime.com/mlauber71/spaces/Public/latest/_db_sql_bigdata_hive_spark_meta_collection?u=mlauber71
URL: Example how to use H2 database to create table with upload and from scratch https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_db_h2_create_table_from_scratch?u=mlauber71
URL: Medium: KNIME, Databases and SQL https://medium.com/low-code-for-advanced-data-science/knime-databases-and-sql-273e27c9702a
URL: Create a Rank variable with leading zeros - KNIME Forum (77507) https://forum.knime.com/t/creating-unique-item-codes-on-a-running-file/77507/5?u=mlauber71
URL: A pragmatic approach to multiple duplicates by takbb https://forum.knime.com/t/duplicate-filter-with-more-conditions/90237/3?u=mlauber71

SQL's row_number() ist your friend
DB SQL - H2 School of duplicates - and how to deal with them


Dealing with duplicates is a constant theme with data scientist. And a lot of things can go wrong. The easienst ways to deal with them is GROUP BY or DISTINCT. Just get rid of them and be done. But as this examples might demonstrate this might not always be the best option. Even if your data provider swears your combined IDs are unique especially in Big Data scenarios there might still be lurking some muddy duplicates and you shoudl still be able to deal with them.

And you should be able to bring a messy dataset into a meaningful table with a nice unique ID without loosing too much information. And this workflow would like to encourage you to think about what to do with your duplicates and not to get caught off guard but to take control :-)

SELECT t2.*
, CONCAT(TRIM(t2."street"), ' - ', TRIM(t2."town")) AS longest_adress_string
-- mark the Top row per ID as the chosen one
, CASE WHEN t2."rank_id" =1 THEN 'chosen'
ELSE 'duplicate'
END AS "duplicate-type-classifier"
FROM

-- start t2
( SELECT *

, ROW_NUMBER() OVER (
-- set which unique ID should be kept - all others will be deleted
-- this makes sure even if there is a *complete* duplicate only one row will remain (and be intact)
PARTITION BY t1."ID"
-- set rules which ID should remain on Top in this case the street and town adress with the longest entry
ORDER BY (LENGTH(t1."street") + LENGTH(t1."town")) DESC
) AS "rank_id"

FROM #table# AS t1
) t2
-- end t2

ORDER BY t2."ID"
, t2."rank_id"

SELECT t2.*

/* mark the Top row per ID as the chosen one */
, CASE WHEN t2."rank_id" =1 THEN 'chosen'
ELSE 'duplicate'
END AS "duplicate_type_classifier"
FROM

/* start t2 */
( SELECT *

, ROW_NUMBER() OVER (
/* set which unique ID should be kept - all others will be deleted */
/* this makes sure even if there is a *complete* duplicate only one row will remain (and be intact) */
PARTITION BY t1."ID"
/* set rules which ID should remain on Top in this case the one with the latest enry and then the last contact */
ORDER BY t1."entry_date" DESC NULLS LAST
, t1."last_contact" DESC NULLS LAST
, t1."purchase" DESC NULLS LAST
) AS "rank_id"

FROM #table# AS t1
) t2
/* end t2 */

ORDER BY t2."ID"
, t2."rank_id"

DB Reader
sql_01extract the SQL codewith which to create an emptytable
DB Query Extractor
the Magic Queryto find the latest entry
DB Query
data_all_deduplicate_01
DB Table Selector
SELECT * FROM #table# AS t1 WHERE 1=2
DB Query
create empty table via SQL string data_all_deduplicate_01
DB SQL Executor
the Magic Querykeep the longest adress string
DB Query
result of row_id Magic
DB Reader
knime://knime.workflow/data/database.h2=> make sure to use the latest driverversion >=1.4.200
H2 Connector
$duplicate-type-classifier$ = "chosen" => TRUE
Rule-based Row Filter
APPEND data_all_deduplicate_01
DB Writer (DB Data)
data_all_deduplicate_01
DB Reader
$duplicate-type-classifier$ = "chosen" => TRUE
Rule-based Row Filter
data_all_deduplicate_01
DB Reader
create a rank per group
Rank
data.xlsx
Excel Reader
length_street_name
Expression
h2_version1.4.1961.4.2002.1.2122.2.224
DB Query Reader
data_allcreate empty table
DB Table Creator
keep the latest entry
Column Filter
data_all
DB Table Selector
String to Date&Time
alternative: SQL Executor withDROP TABLE IF EXISTS default.data_all;
DB Table Deleter
Sorter
the longestadress
Column Filter
data_all
DB Loader

Nodes

Extensions

Links