The Approximate Index Matcher node queries a prebuilt index object, such as one
generated by the Single-Field Indexer, to find values that are similar to a given
set of query terms.
The node accepts two inputs:
An Index Object containing the indexed reference values.
A Comparison Table with the query strings to be matched.
For each query value, the node searches the index using configurable string
similarity algorithms (e.g., Levenshtein, Positional, or Longest Common Subsequence)
and returns either the best match, the top-k matches (not supported yet),
or all matches above a given threshold (not supported yet).
The output can be returned as a standalone results table or by appending match-related
columns to the original comparison table. Match information may include the best matching
index value, match quality score, and optionally the edit script or full list of top-k
candidates.
Using the index structure, this node provides efficient, scalable approximate matching
even for large datasets. Typical use cases include entity resolution, deduplication,
and data cleaning, where inconsistent or noisy text values need to be reconciled with a
trusted reference set.
Options
Select settings group
Select Columns in Comparison Input
Select columns applicable to comparison to the Reference Terms
Row Filter Condition
Controls which rows are included in the output.
Options:
Output matching rows - Only rows that meet similarity criteria are included in the output.
Matching is defined as having a similarity equal or higher than the similarity defined by the
Match Quality Threshold.
A row where at least one column matches is considered a match for the whole row resembling an “OR”
behavior.
Output non-matching rows - Only rows that do not meet the matching criteria are forwarded
into the output. Any column that matches filters away such a row.
No Filtering - All rows are forwarded into the output. This is used to just add the
matching information from the output section and use that information later.
This setting allows you to filter the data based on matching success or analyze all comparisons.
Match Quality Threshold
This setting allows you to set the filter criteria based on Match Quality.
This setting only appears if filtering is actually switched on by the previous setting.
Add Match Quality Column
Depending on search settings
Add Match Sequence Column
Add Best Match Column
Display the match word that fits the input best.
Only makes sense if multiple Match Words are given.
Add Best Match Line Numbers Column
Display the line numbers of the match words (in the index input) that fit the input best.
Only makes sense if multiple Match Words are given.
Matching Algorithm Selector
Select the algorithm used to calculate string similarity between reference and comparison inputs.
Options:
Levenshtein - Calculation of the Edit Distance, which is the minimum number of edit operations
needed to transform the comparison term into the reference term. Allowed edit operations are:
Insertion of a character
Deletion of a character
Substitution of a character
Transposition of two adjacent characters (also referred to as Damerau-Levenshtein extension)
The Levenshtein Algorithm compares, by default, the whole strings from beginning to end.
In special situations, one wants to ignore prefixes or suffixes and find the match at any position
within the other term.
For this, there are options that trigger certain parts of the comparison term or the reference term
to be ignored at no “cost”, meaning not counting as errors. The length of the ignored portion is
optimized as to produce the best possible match.
You can choose 2 of the 4 options simultaneously. 3 or more are not meaningful, since one term
could be completely ignored, not producing a relevant match quality.
Options are the following (described in more detail in their own section):
Ignore Leading Characters in Reference Term
Ignore Leading Characters in Comparison Term
Ignore Trailing Characters in Reference Term
Ignore Trailing Characters in Comparison Term
Positional Matching - Simply compares characters at fixed character positions
(first with first, second with second…). This is needed for comparing e.g. IDs with a fixed format.
Longest Common Subsequence - Finds the longest subsequence of characters that appear
in each of both terms. The sequence itself is not necessarily unique. Only the length of
the longest sequence is important.
Ignore Leading Characters in Reference Term
Skips leading characters in the reference string during matching until
the comparison term starts matching.
Ignore Leading Characters in Comparison Term
Skips leading characters in the comparison string during matching until
the reference term starts matching.
Ignore Trailing Characters in Reference Term
Skips trailing characters in the reference string during matching.
Ignore Trailing Characters in Comparison Term
Skips trailing characters in the comparison string during matching.
Input Ports
Provides the prebuilt index of reference values that will be queried.
Table containing the values to be compared with the indexed reference.
Output Ports
Contains the matching results. Depending on configuration, this may be the original comparison
table with additional match-related columns (e.g., match sequence, match number, top-k matches).