Approximate Index Matcher (Labs)

The Approximate Index Matcher node queries a prebuilt index object, such as one generated by the Single-Field Indexer, to find values that are similar to a given set of query terms.
The node accepts two inputs:

  • An Index Object containing the indexed reference values.
  • A Comparison Table with the query strings to be matched.
For each query value, the node searches the index using configurable string similarity algorithms (e.g., Levenshtein, Positional, or Longest Common Subsequence) and returns either the best match, the top-k matches (not supported yet), or all matches above a given threshold (not supported yet).
The output can be returned as a standalone results table or by appending match-related columns to the original comparison table. Match information may include the best matching index value, match quality score, and optionally the edit script or full list of top-k candidates.
Using the index structure, this node provides efficient, scalable approximate matching even for large datasets. Typical use cases include entity resolution, deduplication, and data cleaning, where inconsistent or noisy text values need to be reconciled with a trusted reference set.

Options

Select settings group
Select Columns in Comparison Input
Select columns applicable to comparison to the Reference Terms
Row Filter Condition
Controls which rows are included in the output.
Options:
  • Output matching rows - Only rows that meet similarity criteria are included in the output. Matching is defined as having a similarity equal or higher than the similarity defined by the Match Quality Threshold.
    A row where at least one column matches is considered a match for the whole row resembling an “OR” behavior.
  • Output non-matching rows - Only rows that do not meet the matching criteria are forwarded into the output. Any column that matches filters away such a row.
  • No Filtering - All rows are forwarded into the output. This is used to just add the matching information from the output section and use that information later.
  • This setting allows you to filter the data based on matching success or analyze all comparisons.
Match Quality Threshold
This setting allows you to set the filter criteria based on Match Quality.
This setting only appears if filtering is actually switched on by the previous setting.
Add Match Quality Column
Depending on search settings
Add Match Sequence Column
Add Best Match Column
Display the match word that fits the input best. Only makes sense if multiple Match Words are given.
Matching Algorithm Selector
Select the algorithm used to calculate string similarity between reference and comparison inputs.
Options:
  • Levenshtein - Calculation of the Edit Distance, which is the minimum number of edit operations needed to transform the comparison term into the reference term. Allowed edit operations are:
    • Insertion of a character
    • Deletion of a character
    • Substitution of a character
    • Transposition of two adjacent characters (also referred to as Damerau-Levenshtein extension)
    The Levenshtein Algorithm compares, by default, the whole strings from beginning to end. In special situations, one wants to ignore prefixes or suffixes and find the match at any position within the other term.
    For this, there are options that trigger certain parts of the comparison term or the reference term to be ignored at no “cost”, meaning not counting as errors. The length of the ignored portion is optimized as to produce the best possible match.
    You can choose 2 of the 4 options simultaneously. 3 or more are not meaningful, since one term could be completely ignored, not producing a relevant match quality.
    Options are the following (described in more detail in their own section):
    • Ignore Leading Characters in Reference Term
    • Ignore Leading Characters in Comparison Term
    • Ignore Trailing Characters in Reference Term
    • Ignore Trailing Characters in Comparison Term
  • Positional Matching - Simply compares characters at fixed character positions (first with first, second with second…). This is needed for comparing e.g. IDs with a fixed format.
  • Longest Common Subsequence - Finds the longest subsequence of characters that appear in each of both terms. The sequence itself is not necessarily unique. Only the length of the longest sequence is important.
Ignore Leading Characters in Reference Term
Skips leading characters in the reference string during matching until the comparison term starts matching.
Ignore Leading Characters in Comparison Term
Skips leading characters in the comparison string during matching until the reference term starts matching.
Ignore Trailing Characters in Reference Term
Skips trailing characters in the reference string during matching.
Ignore Trailing Characters in Comparison Term
Skips trailing characters in the comparison string during matching.

Input Ports

Icon
Provides the prebuilt index of reference values that will be queried.
Icon
Table containing the values to be compared with the indexed reference.

Output Ports

Icon
Contains the matching results. Depending on configuration, this may be the original comparison table with additional match-related columns (e.g., match sequence, match number, top-k matches).

Popular Predecessors

  • No recommendations found

Popular Successors

  • No recommendations found

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.