Icon

Dubletten Adressen_​MZ

Address Deduplication

The workflow looks for duplicate records of restaurants by searching for similar names and addresses in a reference table. The similarity is based on the mean of the 2-gram dice distances of the restaurant name and address.

Mean distance of 2-gram Dice distance of 'name' and'address'. This reference address data set is created by groupingby the class column. Usually reference addresses aretaken from an internal database. This workflow looks for duplicates in restaurant records by similarity search. Dubletten auf Adressenmaske:minderforderung: Firmenname, Straßen, PLZ, Stadt +Hausnummer identisch sein sollen Join reference addresscolumns to columns ofaddress data set withduplicates2-gram dice distanceon 'name'2-gram dice distanceon 'address'Aggregated distanceFind most similardata row of referenceaddress data setfor each data row ofaddress data set withduplicatesprepend toeach column:"ref"Append originalrow id as columnCreate unique reference addressdata setAddress data setwith duplicatesAppend originalrow id as columnNode 56Firmenname gruppiertaggregation: Straßen, Stadt, HausNr. PLZals Reference Adresse data setprepend toeach column:"ref"Aggregated distanceJoin reference addresscolumns to columns ofaddress data set withduplicates2-gram dice distanceon PLZ2-gram dice distanceon Straßen2-gram dice distanceon HausnummerFind most similardata row of referenceaddress data setfor each data row ofaddress data set withduplicatesNode 74Node 75Joiner (deprecated) Rule Engine String Distances String Distances Aggregated Distance Similarity Search Column Rename(Regex) RowID GroupBy ARFF Reader RowID Excel Reader GroupBy Column Rename(Regex) Aggregated Distance Joiner (deprecated) Rule Engine String Distances String Distances String Distances Similarity Search Missing Value Missing Value Mean distance of 2-gram Dice distance of 'name' and'address'. This reference address data set is created by groupingby the class column. Usually reference addresses aretaken from an internal database. This workflow looks for duplicates in restaurant records by similarity search. Dubletten auf Adressenmaske:minderforderung: Firmenname, Straßen, PLZ, Stadt +Hausnummer identisch sein sollen Join reference addresscolumns to columns ofaddress data set withduplicates2-gram dice distanceon 'name'2-gram dice distanceon 'address'Aggregated distanceFind most similardata row of referenceaddress data setfor each data row ofaddress data set withduplicatesprepend toeach column:"ref"Append originalrow id as columnCreate unique reference addressdata setAddress data setwith duplicatesAppend originalrow id as columnNode 56Firmenname gruppiertaggregation: Straßen, Stadt, HausNr. PLZals Reference Adresse data setprepend toeach column:"ref"Aggregated distanceJoin reference addresscolumns to columns ofaddress data set withduplicates2-gram dice distanceon PLZ2-gram dice distanceon Straßen2-gram dice distanceon HausnummerFind most similardata row of referenceaddress data setfor each data row ofaddress data set withduplicatesNode 74Node 75Joiner (deprecated) Rule Engine String Distances String Distances Aggregated Distance Similarity Search Column Rename(Regex) RowID GroupBy ARFF Reader RowID Excel Reader GroupBy Column Rename(Regex) Aggregated Distance Joiner (deprecated) Rule Engine String Distances String Distances String Distances Similarity Search Missing Value Missing Value

Nodes

Extensions

Links