Icon

freshest Workflow_​Deduplication_​of_​Address_​Data

Address Deduplication

The workflow looks for duplicate records of restaurants by searching for similar names and addresses in a reference table. The similarity is based on the mean of the 2-gram dice distances of the restaurant name and address.

Mean distance of 2-gram Dice distance of 'name' and'address'. This reference address data set is created by groupingby the class column. Usually reference addresses aretaken from an internal database. This workflow looks for duplicates in restaurant records by similarity search. Dubletten auf Adressenmaske:minderforderung: Firmenname, Straßen, PLZ, Stadt + Hausnummer identisch sein sollenVersuchen 1: Versuchen 2: Victor palacios: Is it sufficient to use 2 groupbys for this task? Join reference addresscolumns to columns ofaddress data set withduplicates2-gram dice distanceon 'name'2-gram dice distanceon 'address'Aggregated distanceFind most similardata row of referenceaddress data setfor each data row ofaddress data set withduplicatesprepend toeach column:"ref"Append originalrow id as columnCreate unique reference addressdata setAddress data setwith duplicatesAppend originalrow id as columnNode 56Firmenname gruppiertaggregation: Straßen, Stadt, HausNr. PLZals Reference Adresse data setprepend toeach column:"ref"Aggregated distanceJoin reference addresscolumns to columns ofaddress data set withduplicates2-gram dice distanceon PLZ2-gram dice distanceon Straßen2-gram dice distanceon HausnummerFind most similardata row of referenceaddress data setfor each data row ofaddress data set withduplicatesNode 74Node 75Node 762-gram dice distanceon street name2-gram dice distanceon house numberFind most similardata row of referenceaddress data setfor each data row ofaddress data set withduplicatesNode 80Join reference addresscolumns to columns ofaddress data set withduplicatesAppend originalrow id as columnAggregated distanceNode 85look for company names that matchprepend toeach column:"ref"this generates a list of unique rows with 4 matchingreducing my list by 11 rows Joiner (deprecated) Rule Engine String Distances String Distances Aggregated Distance Similarity Search Column Rename(Regex) RowID GroupBy ARFF Reader RowID Excel Reader GroupBy Column Rename(Regex) Aggregated Distance Joiner (deprecated) Rule Engine String Distances String Distances String Distances Similarity Search Missing Value Missing Value Excel Reader String Distances String Distances Similarity Search Missing Value Rule Engine Joiner (deprecated) RowID Aggregated Distance Missing Value GroupBy Column Rename(Regex) GroupBy Mean distance of 2-gram Dice distance of 'name' and'address'. This reference address data set is created by groupingby the class column. Usually reference addresses aretaken from an internal database. This workflow looks for duplicates in restaurant records by similarity search. Dubletten auf Adressenmaske:minderforderung: Firmenname, Straßen, PLZ, Stadt + Hausnummer identisch sein sollenVersuchen 1: Versuchen 2: Victor palacios: Is it sufficient to use 2 groupbys for this task? Join reference addresscolumns to columns ofaddress data set withduplicates2-gram dice distanceon 'name'2-gram dice distanceon 'address'Aggregated distanceFind most similardata row of referenceaddress data setfor each data row ofaddress data set withduplicatesprepend toeach column:"ref"Append originalrow id as columnCreate unique reference addressdata setAddress data setwith duplicatesAppend originalrow id as columnNode 56Firmenname gruppiertaggregation: Straßen, Stadt, HausNr. PLZals Reference Adresse data setprepend toeach column:"ref"Aggregated distanceJoin reference addresscolumns to columns ofaddress data set withduplicates2-gram dice distanceon PLZ2-gram dice distanceon Straßen2-gram dice distanceon HausnummerFind most similardata row of referenceaddress data setfor each data row ofaddress data set withduplicatesNode 74Node 75Node 762-gram dice distanceon street name2-gram dice distanceon house numberFind most similardata row of referenceaddress data setfor each data row ofaddress data set withduplicatesNode 80Join reference addresscolumns to columns ofaddress data set withduplicatesAppend originalrow id as columnAggregated distanceNode 85look for company names that matchprepend toeach column:"ref"this generates a list of unique rows with 4 matchingreducing my list by 11 rows Joiner (deprecated) Rule Engine String Distances String Distances Aggregated Distance Similarity Search Column Rename(Regex) RowID GroupBy ARFF Reader RowID Excel Reader GroupBy Column Rename(Regex) Aggregated Distance Joiner (deprecated) Rule Engine String Distances String Distances String Distances Similarity Search Missing Value Missing Value Excel Reader String Distances String Distances Similarity Search Missing Value Rule Engine Joiner (deprecated) RowID Aggregated Distance Missing Value GroupBy Column Rename(Regex) GroupBy

Nodes

Extensions

Links