Icon

Frequency-Aware Anomaly Detection with Single-Field Indexer

<p>This use case shows how <strong>M|Box Indexing Nodes</strong> can be used to efficiently detect potential errors and uncommon entries by comparing <strong>low-frequency values</strong> against the <strong>most frequent values</strong> within the same dataset.</p><p>By building an index with the <strong>Single-Field Indexer</strong> and comparing all entries using the <strong>Approximate Index Matcher</strong>, the workflow automatically distinguishes between:</p><ul><li><p><strong>Likely typos:</strong> low-frequency entries that closely resemble common ones</p></li><li><p><strong>Rare but valid</strong> values: dissimilar entries that are truly unique</p></li><li><p><strong>Correct entries</strong>: high-frequency values that are assumed to be correct</p></li></ul><p>Index-based matching with the <strong>Single-Field Indexer</strong> and <strong>Approximate Index Matcher</strong> ensures fast, scalable processing for large datasets. This makes it ideal for:</p><ul><li><p>Detecting entry errors in location, product, or customer data</p></li><li><p>Auto-flagging suspicious or rare strings for review</p></li><li><p>Improving data quality in human-entered datasets</p></li></ul><p></p>

URL: exorbyte GmbH https://exorbyte.ai/

Import Data

Import a table containing location names (cities, states), some of which are misspelled.

  • e.g. "Nw York", "Californa"

Request/Activate Exorbyte License

Request and register your exorbyte license before running any M|Box nodes.

If you do not have an active license, within the License Requester:

  1. Choose Demo (30 days) or Production.

  2. Enter your email (and Customer Token if production).

  3. Execute the node – it sends a secure request to the exorbyte team.

  4. When you receive the .lic file, reopen the node → Use available license fileand run the node

Afterwards, or if you already have an active license, run License Activator

⚠️ Each KNIME installation or Hub environment needs its own license

👉 See full exorbyte License Activation Guide

Preparation

  • Count name occurrences

  • Reset Row IDs

  • Sort by name frequency

Obtain Frequency Threshold

We need a dynamic way to find a cutoff point for which names are common enough to be considered "correct" reference values and should therefore be included in the index.

In real-world data, correct location names usually appear much more often than misspellings or unusual entries. We can leverage this by:

  • Calculating the changes in name frequency from high to low

  • Finding the steepest drop-off in frequency

  • Filtering for all names before this point

Indexing & Matching

Compare all location names against the most frequent names to identify valid entries, typos, and rare values.

An index is created from the frequent location names using M|Box's Single-Field Indexer node. Each location name is matched against this index using the Approximate Index Matcher and classified as:

  • Valid: Exact match in the index

  • Typo: Close match above a similarity threshold

  • Rare: No match in the index. These names are uncommon and are likely not present in the reference index.

Results

Visualize the results of the frequency-aware anomaly detection:

  • A bar chart showing how many location names were sorted into each of the three categories

  • A table view that displays location names of a category selected by the user

CSV Reader
Reset RowIDs
RowID
License Activator (Labs)
Count frequencies
Value Counter
Column Resorter
Approximate Index Matcher (Labs)
Sort by frequency
Sorter
Rule Engine
Most Frequent Names
Visualizations
Create Index
Single-Field Indexer (Labs)
License Requester (Labs)

Nodes

Extensions

Links