String Distances

Distance definition on a string column, like for instance Levenshtein distance. Additional parameters can be set based on the selected distance function. The available distance measures are:

Levenshtein Distance: The Levenshtein distance is one of the famous string metrics for measuring the difference between two strings. It is the minimum number of operations (i.e. deletions, insertions or substitutions) performed on a single character to transform one of the strings into the other.

Jaro-Winkler Distance: The Jaro-Winkler distance is a common measure for the difference between two strings. It is important to note that it is not actually a distance metric because it doesn't obey the triangle-inequality.

Tversky Distance: In contrast to the edit distances the n-gram Tversky index does not rely on single-operation counts, instead it models relations between neighbored letters.

Options

Column selection
Choose the column for which the string distance is defined.
Distance selection
Select the distance measure to use for comparing strings.
Winkler threshold
The modification proposed by Winkler reduces the distance between strings that have a common prefix. This threshold controls at which Jaro similarity the Winkler modification is applied. A threshold of 0 will always apply the modification while a threshold of 1 will never apply the modification resulting in the Jaro distance. Our default of 0.7 is chosen to match the threshold in the python package textdistance.
Prefix weight
The weight with which Winkler's prefix modification changes the Jaro similarity.
Max prefix length
The maximal length of a common prefix to consider (4 by default). Common prefix longer than this value will only affect the Jaro similarity up to this length.
Case sensitive
If checked, the distance considers upper and lower case characters to be different i.e. "A" and "a" are not considered to be the same.
Deletion weight
The deletions are weighted according to the given value.
Insertion weight
The insertions are weighted according to the given value.
Exchange weight
The exchanges are weighted according to the given value.
Normalize distance
The resulting distance is in the range [0,1].
Uppercase input
Transform all characters to uppercase before computing the distance. For performance issues it is preferable to uppercase the input in a precomputation step instead of checking this option.
Gram size
Determines the number of contiguous items per gram.
Tversky index
Select a predefined Tversky index with specific alpha and beta parameters. Selecting a predefined index will override any custom alpha and beta values.
  • Dice: The alpha and beta parameter values are equal to 0.5.
  • Tanimoto: The alpha and beta parameter values are equal to 1.
  • Custom: Define custom alpha and beta parameter values.
Alpha
The Tversky alpha parameter. Must be non-negative.
Beta
The Tversky beta parameter. Must be non-negative.
Uppercase input
Transform all characters to uppercase before computing the distance. For performance reasons it is preferable to convert the input to uppercase in a preprocessing step instead of using this option.

Input Ports

Icon
Input data.

Output Ports

Icon
The configured distance.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.