Icon

ML Python 400 - Impute KNN-Based Mixed-Type

<p><strong>KNN-Based Mixed-Type Imputation Workflow</strong></p><p>Imputes missing values in numeric and categorical data using a K-Nearest Neighbors model with scaling and reversible encoding.</p>

URL: Inspired by: Handling “Missing Data” Like a Pro — Part 3: Model-Based & Multiple Imputation Methods https://towardsdatascience.com/handling-missing-data-like-a-pro-part-3-model-based-multiple-imputation-methods-bdfe85f93087

Explanation of the KNN-based Imputation Method

This imputation pipeline uses K-Nearest Neighbors (KNN) to fill missing values in both numeric and categorical (string) variables.

1️⃣ Numeric variables

  • All numeric features are scaled to the 0–1 range using MinMaxScaler so that variables with large magnitudes (e.g., income vs. age) don’t dominate the distance calculation.

  • The KNNImputer then replaces each missing numeric value with the average of the corresponding values from its k nearest neighbors (default k = 5).

  • “Nearest” means “most similar” rows, based on Euclidean distance across all available features (both numeric and encoded categorical).

  • After imputation, the numeric values are rescaled back to their original scale.

2️⃣ Categorical (string) variables

  • Categorical columns are first converted to numeric codes using OrdinalEncoder. Each unique category becomes a distinct integer code, and missing values remain as NaN.

  • During KNN imputation, missing codes are replaced with the mean code of the nearest neighbors, which is then rounded to the nearest integer before converting back to the original category labels.

  • This effectively assigns the most typical category among similar rows to each missing entry.

3️⃣ Datetime variables

  • Datetimes (if present) are temporarily converted to seconds since epoch (numeric form) so that KNN can treat them as numeric features.


✅ In summary:
KNN imputation fills gaps by “borrowing” information from similar rows rather than using a global statistic like mean or mode. It works for mixed data types (numeric + categorical) and preserves relationships between features, producing more context-aware imputations than simple mean/mode filling.

KNN-Based Mixed-Type Imputation Workflow

Imputes missing values in numeric and categorical data using a K-Nearest Neighbors model with scaling and reversible encoding.

Apply KNN Missing Value Imputation to new data
../nmodel/ folder will contain

py_400_knn_imputer.pkl

py_400_knn_scaler.pkl

py_400_knn_encoder.pkl

Joiner
../data/train_missings.table
Table Reader
Reference Column Filter
regular data
Table to H2O
H2O Local Context
H2O Model to MOJO
Joiner
regular data
H2O MOJO Predictor (Regression)
/data/ /model/
determine paths
Column Name Replacer
locate and create /data/ folder with absolute paths
Collect Local Metadata
new_id
Java Snippet (simple)
Joiner
Target_Double
Math Formula
compare the performance of the two approaches:
RowID
../model/kn_knn_imputer_dictionary.zip
Model Writer
KNN Apply
Python Script
KNN
Python Script
there should be nodifferences
Table Difference Finder
../model/kn_knn_imputer_dictionary.zip
Model Reader
imputed data
Table to H2O
Column Name Replacer
Merge Variables
Numeric Scorer
H2O Column Filter
Numeric Scorer
var_excluded_features row_id customer_number
String Configuration
H2O Model to MOJO
Reference Column Filter
imputed data
H2O Gradient Boosting Machine Learner (Regression)
Column Filter
var_labelTarget
String Configuration
imputed data
H2O MOJO Predictor (Regression)
../data/test_missings.table
Table Reader
KNN Apply
Python Script
Activate Conda Environmentbased on Operating SystemWindows or macOS
conda_environment_impute
regular data
H2O Gradient Boosting Machine Learner (Regression)
H2O Column Filter
Column Filter
Target_Double
Math Formula

Nodes

Extensions

Links