Approximate Phrase Matcher (Labs)

This node calculates the string similarity score between values from one column in a reference table and one or more columns from the comparison table. Additionally, it allows to filter the rows based on if there is a similar match or not.
The node computes the similarity between strings using configurable algorithms such as Levenshtein Edit Distance, Longest Common Subsequence, or Positional Matching, and applies a user-defined threshold to determine which rows are considered a match.
Each field value in the comparison input is matched against each reference input value. The best match of any of these single comparisons is decisive for the filter decision. So we consider the columns in the comparison table as well as the rows from the reference table combined in a logical OR fashion.
The comparison algorithms are described in detail in the options section.
As output of this node the user can choose to select the rows that match according to the filter value, the rows that do not match the criteria or all rows.
In either case additional columns can be generated that contain the computed criteria, namely the numeric match value, the best reference match and a string showing the modifications of the comparison string to align with the reference string. These extra fields can support downstream processing and decision-making beyond simple filtering.
This node may only be used for private and non-commercial purposes. Commercial use requires a valid license from exorbyte GmbH. All rights reserved.
For more information contact consulting@exorbyte.com.

Options

Select Settings Group
Allowing the user to navigate through different sections of the configuration options
  • Input
  • Search
  • Output
Select Column in Reference Input
Select a column of the Reference Input Table to be used as list of Reference Terms
Select Columns in Comparison Input
Select columns applicable to comparison to the Reference Terms
Add Column with Numeric Matching Value
Appends a column showing the calculated similarity between strings. This may be the algorithm-specific matching value or percentage similarity, depending on the selected numeric metric.
The column is named “<input column name> - Match Number” and is useful for scoring and applying thresholds.
Add Column with Character Match Sequence
Appends a column showing a symbolic alignment of characters in the comparison and reference strings.
'=' -> Match
'o' -> Substitution
'x' -> Deletion
'+' -> Insertion
'><' -> Transposition
The column name is “<input column name> - Match Sequence”.
Add Column with Hit Characters Sequence
Appends a column showing which characters in the Comparison String were matched.
Add Column with Best Reference Match
Appends the most similar reference string for each row in the comparison table. If there are multiple reference strings with the same Numeric Matching Value, the last Reference Column entry with best value is taken.
Ideal for deduplication, linking, and labeling workflows where a single best match is sufficient.
The column name is “<input column name> - Best Match”.
Matching Algorithm Selector
Select the algorithm used to calculate string similarity between reference and comparison inputs.
Options:
  • Levenshtein - Calculation of the Edit Distance, which is the minimum number of edit operations needed to transform the comparison term into the reference term. Allowed edit operations are:
    • Insertion of a character
    • Deletion of a character
    • Substitution of a character
    • Transposition of two adjacent characters (also referred to as Damerau-Levenshtein extension)
    The Levenshtein Algorithm compares, by default, the whole strings from beginning to end. In special situations, one wants to ignore prefixes or suffixes and find the match at any position within the other term.
    For this, there are options that trigger certain parts of the comparison term or the reference term to be ignored at no “cost”, meaning not counting as errors. The length of the ignored portion is optimized as to produce the best possible match.
    You can choose 2 of the 4 options simultaneously. 3 or more are not meaningful, since one term could be completely ignored, not producing a relevant matching number.
    Options are the following (described in more detail in their own section):
    • Ignore Leading Characters in Reference Term
    • Ignore Leading Characters in Comparison Term
    • Ignore Trailing Characters in Reference Term
    • Ignore Trailing Characters in Comparison Term
  • Positional Matching - Simply compares characters at fixed character positions (first with first, second with second…). This is needed for comparing e.g. IDs with a fixed format.
  • Longest Common Subsequence - Finds the longest subsequence of characters that appear in each of both terms. The sequence itself is not necessarily unique. Only the length of the longest sequence is important.
Case Sensitivity
Determines whether the matching algorithm is case-sensitive or case-insensitive.
Numeric Matching Value
The Positional Matching algorithm computes a specific matching value from which a similarity can be deduced.
You can choose whether to use the algorithm specific matching number or the normalized similarity score ranging from 0-100 for the filtering.
  • The Positional Matching specific matching value is the Number of matching character positions. This is a “positive” number as for LCS. Highest Similarity comes with a matching number of the maximum of the lengths of both terms.
  • The Similarity in Percent is a match score as a percentage. This option is algorithm independent. It is a normalized value always ranging from 0-100 for all algorithms, 100 being a full match (according to the algorithm).
Row Filter Condition
Controls which rows are included in the output.
Options:
  • Output matching rows - Only rows that meet similarity criteria are included in the output. Matching is defined as having a similarity equal or higher than the similarity defined by the Matching Value Threshold. Be aware that if using the raw numeric matching value a match might mean a lower(!) value than the threshold for a “negative” matching number like the number of errors.
    A row where at least one column matches is considered a match for the whole row resembling an “OR” behavior.
  • Output non-matching rows - Only rows that do not meet the matching criteria are forwarded into the output. Any column that matches filters away such a row.
  • No Filtering - All rows are forwarded into the output. This is used to just add the matching information from the output section and use that information later.
Matching Value Threshold - Minimal Number of Matches
This setting allows you to set the filter criteria based on the selection of the Numeric Matching Value.
This setting only appears, if filtering is actually switched on by the previous setting.
If the algorithm specific matching value was chosen, it applies to this number. If similarity was chosen, the value here is also a similarity threshold.
Matching Value Threshold - Minimal Matching Percentage
This setting allows you to set the filter criteria based on the selection of the Numeric Matching Value.
This setting only appears, if filtering is actually switched on by the previous setting.
If the algorithm specific matching value was chosen, it applies to this number. If similarity was chosen, the value here is also a similarity threshold.

Input Ports

Icon
Mapping
Icon
Table containing the canonical string values to match against.
Icon
Table containing the values to be compared with the reference.

Output Ports

Icon
Table containing the output rows with appended columns and results.

Popular Predecessors

  • No recommendations found

Popular Successors

  • No recommendations found

Views

This node has no views

Workflows

  • No workflows found

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.