Icon

Term Matcher Overview Examples

<p>This workflow demonstrates the power of the <strong>Term Matcher</strong> node in handling messy, inconsistent, or typo-prone data across common data processing tasks. It showcases <strong>fault-tolerant matching</strong> for different scenarios such as search, joins, and aggregation. The node enables flexible string comparison using algorithms like <strong>Levenshtein</strong>, <strong>Positional</strong>, and <strong>LCS</strong>.</p><p>🔍 Key Use Cases Highlighted:</p><ul><li><p>🧠 <strong>Algorithm Comparison</strong> – Understand how each matching method behaves</p></li><li><p>🔐 <strong>Safe Aggregation</strong> – Clean your data before aggregation using fuzzy matching</p></li><li><p>🤗 <strong>Safe Joining</strong> – Join datasets on fuzzy-matched fields</p></li><li><p>🔍 <strong>Fault-Tolerant Search</strong> – Perform typo-tolerant searches with user queries</p></li><li><p>🧠 <strong>Frequency-Aware Anomaly Detection</strong> – Detect potential data errors or outliers by comparing the least frequent values against the most frequent ones in the same dataset</p></li><li><p>✅ <strong>Variation Lookup Using Approximate Matching</strong> – Retrieve all similar strings to a given input (e.g., “Munich”) to identify spelling variations or typos</p></li></ul>

URL: exorbyte GmbH https://exorbyte.com/en



🧠 Matching Algorithm Comparison

This section demonstrates how different approximate matching algorithms behave under various input conditions.

  • Levenshtein handles typical typos and character edits.

  • Positional is strict and best for fixed-format codes (e.g., IDs like "AB-1234").

  • LCS (Longest Common Subsequence) tolerates gaps and partial overlaps.

Use this to choose the most appropriate algorithm for your use case.


🔐 Safe Aggregation

You can use this node to clean your data before aggregation.


🤗 Safe Joining

You can use this node to join uncleaned data safely.


🔍 Fault-Tolerant Search

One key use case of the Approximate Matcher node is enabling fault-tolerant search.
It allows users to retrieve relevant matches even when data contains typos, inconsistent casing, or partial strings—common issues in real-world datasets like customer names, cities, or product IDs.

This makes it ideal for:

  • Dirty or user-entered text fields

  • Matching records from different systems

  • Building robust search features in data pipelines

Use this node to avoid missing matches due to small differences in spelling or formatting.


This use case demonstrates how the Approximate String Matcher node can be used to detect potential errors or rare entries by matching the least frequent values against the most frequent ones in the same dataset.

Using approximate string matching (e.g., Levenshtein distance), we can distinguish:

  • Likely typos — low-frequency entries that closely resemble high-frequency ones

  • Rare but valid values — dissimilar entries that are truly unique

  • Correct entries — high-frequency values, often assumed correct

This makes it ideal for:

  • Detecting entry errors in location, product, or customer data

  • Auto-flagging suspicious or rare strings for review

  • Improving data quality in human-entered datasets

🧠 Frequency-Aware Anomaly Detection

✅ Variation Lookup Using Approximate Matching

This example demonstrates how to use the Approximate String Matcher node to return all similar values to a given input string (e.g., "Munich"), allowing users to find spelling variations or typos in the dataset.

The node is configured in Row Filter Mode, where:

  • The search phrase is passed from a Table Creator

  • The dataset contains values with potential spelling errors

  • A Levenshtein edit distance threshold controls how tolerant the match is

🔍 Performs approximate matching between the search phrase and city names in the dataset.

  • Set to Row Filter Mode: outputs only matching rows

  • Match method: Levenshtein

  • Threshold: e.g., max 5 edit operations

Join on City Name - Best Match
Joiner
Comparison Table
Table Creator
Reference
Table Creator
Match least and most frequent city entries(using Levenshtein)
Term Matcher
Search Phrase(City Name)
Table Creator
Reorder the columns
Table Manipulator
City Sales Dataset with typos
CSV Reader
Append a constant value column (=1)
Constant Value Column (deprecated)
Count frequency of each unique city name
GroupBy
Sales Data (Clean)
Table Reader
Top N most frequent city names as the reference table. (Likely correct city names)
Row Filter
Search Phrase(Customer Name)
Table Creator
Sort city names by frequency (most frequent first)
Sorter
Finding MinimumLevenshtein Distance
GroupBy
German Cities Sales Data(Typos in city names)
CSV Reader
Table Row to Variable
Search
Row Filter
Positional
Term Matcher
City Sales Dataset with typos
CSV Reader
Label rows as “Correct”, “Likely Typo”, or “Rare” based on similarity score
Rule Engine
LCS (Longest Common Subsequence)
Term Matcher
Levenshtein
Term Matcher
Cleaning the City Name
Term Matcher
City Sales Dataset with typos
CSV Reader
Approximate Matching
Term Matcher
German Cities Dataset
CSV Reader
Matching the City Name
Term Matcher
Average Salesper City Name - Best Match
GroupBy
Cleaning the City Name
Term Matcher
City manager datasetEach city might have one or several managers
CSV Reader
Term Matcher
German Cities Dataset
CSV Reader

Nodes

Extensions

Links