Explanation of the KNN-based Imputation Method
This imputation pipeline uses K-Nearest Neighbors (KNN) to fill missing values in both numeric and categorical (string) variables.
1️⃣ Numeric variables
All numeric features are scaled to the 0–1 range using MinMaxScaler so that variables with large magnitudes (e.g., income vs. age) don’t dominate the distance calculation.
The KNNImputer then replaces each missing numeric value with the average of the corresponding values from its k nearest neighbors (default k = 5).
“Nearest” means “most similar” rows, based on Euclidean distance across all available features (both numeric and encoded categorical).
After imputation, the numeric values are rescaled back to their original scale.
2️⃣ Categorical (string) variables
Categorical columns are first converted to numeric codes using OrdinalEncoder. Each unique category becomes a distinct integer code, and missing values remain as NaN.
During KNN imputation, missing codes are replaced with the mean code of the nearest neighbors, which is then rounded to the nearest integer before converting back to the original category labels.
This effectively assigns the most typical category among similar rows to each missing entry.
3️⃣ Datetime variables
✅ In summary:
KNN imputation fills gaps by “borrowing” information from similar rows rather than using a global statistic like mean or mode. It works for mixed data types (numeric + categorical) and preserves relationships between features, producing more context-aware imputations than simple mean/mode filling.