Icon

01_​Deduplication_​of_​Address_​Data

Address Deduplication

The workflow looks for duplicate records of restaurants by searching for similar names and addresses in a reference table. The similarity is based on the mean of the 2-gram dice distances of the restaurant name and address.

Mean distance of 2-gram Dice distance of 'name' and'address'. This reference address data set is created by groupingby the class column. Usually reference addresses aretaken from an internal database. This workflow looks for duplicates in restaurant records by similarity search. Join reference addresscolumns to columns ofaddress data set withduplicates2-gram dice distanceon 'name'2-gram dice distanceon 'address'Aggregated distanceFind most similardata row of referenceaddress data setfor each data row ofaddress data set withduplicatesprepend toeach column:"ref"Append originalrow id as columnCreate unique reference addressdata setAddress data setwith duplicates Joiner Rule Engine String Distances String Distances Aggregated Distance Similarity Search Column Rename(Regex) RowID GroupBy ARFF Reader Mean distance of 2-gram Dice distance of 'name' and'address'. This reference address data set is created by groupingby the class column. Usually reference addresses aretaken from an internal database. This workflow looks for duplicates in restaurant records by similarity search. Join reference addresscolumns to columns ofaddress data set withduplicates2-gram dice distanceon 'name'2-gram dice distanceon 'address'Aggregated distanceFind most similardata row of referenceaddress data setfor each data row ofaddress data set withduplicatesprepend toeach column:"ref"Append originalrow id as columnCreate unique reference addressdata setAddress data setwith duplicatesJoiner Rule Engine String Distances String Distances Aggregated Distance Similarity Search Column Rename(Regex) RowID GroupBy ARFF Reader

Nodes

Extensions

Links