Icon

Fuzzy Category Cleaner - Preparing Categorical Data for Machine Learning

<p><strong>🧹 Cleaning Noisy Categories for ML</strong></p><p><br>This workflow demonstrates how to <strong>clean categorical labels</strong> before training a machine learning model.</p><p>Real-world datasets often contain <strong>inconsistent or misspelled category values</strong> (e.g., Logiystics, Eduzcation, Healthcar). If used directly, these noisy labels fragment the data and reduce model accuracy.</p><p>🔑 <strong>Steps in this workflow:</strong></p><ol><li><p>📂 <strong>Load Product Sales Data</strong> – dataset with features: Units Sold, Purchase Probability, Sales Channel, and noisy Category.</p></li><li><p>🏷️ <strong>Reference Category Labels</strong> – define the valid set of canonical categories (Electronics, Logistics, Education, Healthcare, Finance).</p></li><li><p>🔍 <strong>Approximate String Matcher</strong> – apply Levenshtein distance to align noisy category values with their closest valid label.</p></li></ol><p>✅ <strong>Result:</strong> A cleaned dataset where all category labels are consistent and ML-ready.</p>

URL: exorbyte GmbH https://www.exorbyte.com/en

📂 Dataset Overview

We are working with a dataset that contains the following columns:

  • Units Sold – number of items sold (integer)

  • Purchase Probability – likelihood of purchase (float between 0–1)

  • Sales Channel – sales channel (Online/Retail/Wholesale/Direct)

  • Category – product industry label (noisy, contains typos and inconsistent spellings)

The goal is to build a machine learning classifier to predict the Category column based on the other features.

However, since the Category column is noisy, we first need to:

  1. Clean the labels using the Approximate String Matcher to unify them into canonical categories (Electronics, Logistics, Education, Healthcare, Finance).

  2. Encode the cleaned labels into numeric form for model training.

This ensures the classifier learns from true patterns in the data rather than being confused by typos or formatting inconsistencies.

🏷️ Reference Category Labels

Let's define the list of the valid labels

  • Electronics

  • Logistics

  • Education

  • Healthcare

  • Finance

🔍 Term Matcher

In this step, we use the Levenshtein distance algorithm to compare the noisy Category labels in our dataset with the clean reference labels.

  • Each noisy label is matched to its closest valid category.

  • The best match is selected based on the lowest edit distance.

  • Example:

    • Logiystics → Logistics

    • Eduzcation → Education

    • Healthcar → Healthcare

🔐 How to Get Your License

Use this node to request and register your exorbyte matchmaker license before running any toolbox nodes.

  1. Choose Demo (30 days) or Production.

  2. Enter your email (and Customer Token if production).

  3. Execute the node — it sends a secure request to exorbyte team.

  4. If offline, manually email the request file toknime-node-license@exorbyte.com.

  5. When you receive the .lic file, reopen the node → Use available license fileand run the node → run License Activator.

⚠️ Each KNIME installation or Hub environment needs its own license.

👉 See full workflow guide: How to license exorbyte Extension

Product Sales
CSV Reader
Valid Labels
Table Creator
Normalizing the Category Names
Term Matcher
License Requester
License Activator

Nodes

Extensions

Links