📂 Dataset Overview
We are working with a dataset that contains the following columns:
Units Sold – number of items sold (integer)
Purchase Probability – likelihood of purchase (float between 0–1)
Sales Channel – sales channel (Online/Retail/Wholesale/Direct)
Category – product industry label (noisy, contains typos and inconsistent spellings)
The goal is to build a machine learning classifier to predict the Category column based on the other features.
However, since the Category column is noisy, we first need to:
Clean the labels using the Approximate String Matcher to unify them into canonical categories (Electronics, Logistics, Education, Healthcare, Finance).
Encode the cleaned labels into numeric form for model training.
This ensures the classifier learns from true patterns in the data rather than being confused by typos or formatting inconsistencies.