Icon

Frequency-Aware Anomaly Detection

<p>This use case demonstrates how the Approximate String Matcher node can be used to detect potential errors or rare entries by <strong>matching the least frequent values against the most frequent ones</strong> in the same dataset.</p><p>Using approximate string matching (e.g., Levenshtein distance), we can distinguish:</p><ul><li><p><strong>Likely typos</strong> — low-frequency entries that closely resemble high-frequency ones</p></li><li><p><strong>Rare but valid</strong> values — dissimilar entries that are truly unique</p></li><li><p><strong>Correct entries</strong> — high-frequency values, often assumed correct</p></li></ul><p>This makes it ideal for:</p><ul><li><p>Detecting entry errors in location, product, or customer data</p></li><li><p>Auto-flagging suspicious or rare strings for review</p></li><li><p>Improving data quality in human-entered datasets</p></li></ul>

URL: exorbyte GmbH https://www.exorbyte.com/en

This use case demonstrates how the Approximate String Matcher node can be used to detect potential errors or rare entries by matching the least frequent values against the most frequent ones in the same dataset.

Using approximate string matching (e.g., Levenshtein distance), we can distinguish:

  • Likely typos — low-frequency entries that closely resemble high-frequency ones

  • Rare but valid values — dissimilar entries that are truly unique

  • Correct entries — high-frequency values, often assumed correct

This makes it ideal for:

  • Detecting entry errors in location, product, or customer data

  • Auto-flagging suspicious or rare strings for review

  • Improving data quality in human-entered datasets

🧠 Frequency-Aware Anomaly Detection

Label rows as “Correct”, “Likely Typo”, or “Rare” based on similarity score
Rule Engine
Append a constant value column (=1)
Constant Value Column (deprecated)
Count frequency of each unique city name
GroupBy
Top N most frequent city names as the reference table. (Likely correct city names)
Row Filter
Sort city names by frequency (most frequent first)
Sorter
Match least and most frequent city entries(using Levenshtein)
Term Matcher
German Cities Sales Data(Typos in city names)
CSV Reader

Nodes

Extensions

Links