Icon

Explainable Fuzzy Matching - Typo Error Statistics

<p><strong>Explainable Fuzzy Matching on Payee Data</strong></p><p>This workflow demonstrates how to use <strong>Approximate String Matching</strong> to reconcile noisy, user-entered payee names with a clean reference list of canonical entities. Beyond generating similarity scores, the workflow provides <strong>explainable error statistics</strong> to highlight where and how mismatches occur.</p><p>🔹 Steps in the Workflow</p><ol><li><p><strong>📂 Load Data</strong></p><ul><li><p>Reference Data: clean list of canonical payee names.</p></li><li><p>Payee Data with Typos: noisy, real-world names entered by users.</p></li></ul></li><li><p><strong>🔍 Approximate String Matching (Levenshtein)</strong></p><ul><li><p>Matches each entered payee name against the reference list.</p></li><li><p>Produces a <strong>Match Sequence</strong> (e.g., oooo=ooo=ox=+) that explains differences character by character:</p><ul><li><p>o → match</p></li><li><p>= → substitution (wrong character)</p></li><li><p>+ → insertion (extra character)</p></li><li><p>x → deletion (missing character)</p></li></ul></li></ul></li><li><p><strong>🧮 Error Type Analysis</strong></p><ul><li><p>Counts substitutions, insertions, deletions, and matches.</p></li><li><p>Calculates error ratios, edit distance, and match accuracy.</p></li><li><p>Provides <strong>explainable quality metrics</strong> for each match.</p></li></ul></li><li><p><strong>📊 Aggregation &amp; Statistics</strong></p><ul><li><p>Groups results by reference payee.</p></li><li><p>Computes the <strong>average error profile per entity</strong> (e.g., “Deutsche Bank AG entries often miss characters”).</p></li><li><p>Rounds and formats values for readability.</p></li></ul></li><li><p><strong>📈 Interactive Dashboard</strong></p><ul><li><p>Table of canonical payees with their <strong>average match accuracy</strong>.</p></li><li><p>Bar chart showing the <strong>distribution of error types</strong> (substitution, insertion, deletion).</p></li><li><p>Clear insights into where manual review may be needed and which vendors/customers are most error-prone.</p></li></ul></li></ol><p>🔹 Business Value</p><ul><li><p><strong>Data Quality Monitoring</strong> → Understand how user-entered names deviate from reference data.</p></li><li><p><strong>Explainable Matching</strong> → Not just similarity scores, but insights into <em>why</em> mismatches occur.</p></li><li><p><strong>Operational Efficiency</strong> → Identify entities requiring frequent manual corrections.</p></li><li><p><strong>Compliance Support</strong> → Improve accuracy for KYC, AML, and financial reconciliation tasks.</p></li></ul>

URL: exorbyte GmbH https://www.exorbyte.com/en

📂 Load Data

We start by loading two datasets:

  • Reference Data → clean payee names

  • Payee Names with Typos → noisy, user-entered names

🔍 Approximate String Matching

We apply the Approximate String Matcher (Levenshtein) to align entered names with the reference list.
This gives us a Match Sequence that explains where characters match, differ, or are missing.

Example:
Entered Name → JP Morgan Chse Ltd
Reference Name → JPMorgan Chase & Co

Match Sequence → "oooo=ooo=ox=+"

Legend:

  • o → correct character match

  • = → substitution (wrong character)

  • x → deletion (missing character)

  • + → insertion (extra character)

This way, users not only see the similarity score but also why and where the names differ.

🧮 Calculate Error Types

Using Expressions, we analyze the Match Sequence:

  • Count substitutions (=), insertions (+), deletions (x), and matches (o)

  • Compute error ratios and overall match accuracy

📊 Aggregate by Reference

With GroupBy, we group results by the best matching reference payee.
This shows the average error profile per entity (e.g., Deutsche Bank entries often miss letters).

And then we use Number Rounding to make statistics easier to read and compare.

📈 Error Statistics Dashboard

Finally, the results are presented as a clear error statistics table and visualization, helping business users understand:

  • Which payees are most affected by typos

  • What kinds of errors occur most often

  • Where manual review may be needed

Reference Data
CSV Reader
Aggregate by Reference
GroupBy
Typo Error Statistics
Calculate Error Types
Expression
Number Rounder
Payee Names with typos
CSV Reader
Term Matcher

Nodes

Extensions

Links