Icon

ML Python 200 - Impute Categorical and Numeric with Mice LightGBM

<p>Two-Step MICE Imputation Workflow in KNIME with&nbsp;miceforest</p><p></p><p>This process trains a multiple imputation model on an initial dataset using&nbsp;miceforest&nbsp;(Step 1) and then applies the trained model to new data (Step 2). The training step prepares the schema, fits the LightGBM-based MICE kernel, and saves both the model and metadata. The application step ensures new data exactly matches the training schema before imputing missing values, guaranteeing consistent and reliable imputations across datasets.</p>

URL: Handling “Missing Data” Like a Pro — Part 3: Model-Based & Multiple Imputation Methods https://towardsdatascience.com/handling-missing-data-like-a-pro-part-3-model-based-multiple-imputation-methods-bdfe85f93087
URL: MEDIUM BLOG - Data preparation for Machine Learning with KNIME and the Python “vtreat” package https://medium.com/lp/efcaf58fa783

Two-Step MICE Imputation Workflow in KNIME with miceforest

This process trains a multiple imputation model on an initial dataset using miceforest (Step 1) and then applies the trained model to new data (Step 2).

The training step prepares the schema, fits the LightGBM-based MICE kernel, and saves both the model and metadata.

The application step ensures new data exactly matches the training schema before imputing missing values, guaranteeing consistent and reliable imputations across datasets.

Handling “Missing Data” Like a Pro — Part 3: Model-Based & Multiple Imputation Methods

https://medium.com/p/bdfe85f93087

Training & Saving the MICE Imputer

Purpose:
This script runs inside a KNIME Python node to prepare data, train a miceforest multiple-imputation model, and export both the trained kernel and key metadata for later use.

Key steps:

  • Reads KNIME’s input table and identifies:

    • Features to impute

    • Categorical vs. numeric columns

    • Column category levels

  • Converts categoricals to pd.Categorical and stores their levels

  • Trains a miceforest.ImputationKernel with LightGBM backend

  • Optionally completes the training data’s missing values

  • Outputs:

    • Output Table 0: Imputed training data

    • Output Object 0: The trained MICE kernel

    • Output Object 1: Metadata dictionary (schema info, categories, etc.)

Applying the Trained MICE Imputer to New Data

Purpose:
This script takes new incoming data in KNIME and applies the trained MICE imputer from step 1, making sure the new data’s schema matches the training schema exactly before imputation.

Key steps:

  • Loads the trained kernel and metadata from KNIME object ports

  • Reads the new KNIME input table

  • Aligns columns to the exact order, names, dtypes, and category levels used during training

  • Coerces column dtypes (categorical/numeric/etc.) to match training exactly

  • Runs impute_new_data() and completes the imputation

  • Reattaches untouched “rest” columns

  • Outputs:

    • Output Table 0: The fully imputed new dataset

MEDIUM BLOG

Data preparation for Machine Learning with KNIME and the Python “vtreat” package

https://medium.com/lp/efcaf58fa783

../data/train_missings.table
Table Reader
../data/test_missings.table
Table Reader
excluded_features = ['row_id']label = 'Target'Make sure to edit these settings if you have different columns you want to exclude
Python Script
imputed variables TRAIN
Reference Column Resorter
regular data
Table to H2O
H2O Local Context
H2O Model to MOJO
regular data
H2O MOJO Predictor (Regression)
../model/kn_mice_lightgbm_kernel_imputer.zip
Model Reader
apply the imputation
Python Script
../model/kn_mice_lightgbm_dictionary_imputer.zip
Model Writer
regular data
H2O Gradient Boosting Machine Learner (Regression)
Column Name Replacer
../model/kn_mice_lightgbm_kernel_imputer.zip
Model Writer
imputed variables TEST
Reference Column Resorter
new_id
Java Snippet (simple)
Extract Table Spec
Joiner
Target_Double
Math Formula
compare the performance of the two approaches:
RowID
Target_Double
Math Formula
Table Difference Finder
var_no_mice_iterations default=10
Integer Configuration
imputed data
Table to H2O
Column Name Replacer
Merge Variables
Numeric Scorer
../data/kn_train_mice_lightgbm_imputed.table
Table Writer
H2O Column Filter
Numeric Scorer
Extract Table Spec
H2O Model to MOJO
../data/kn_test_mice_lightgbm_imputed.table
Table Writer
imputed data
H2O Gradient Boosting Machine Learner (Regression)
Table Validator (Reference)
H2O Column Filter
imputed data
H2O MOJO Predictor (Regression)
Table Validator (Reference)
../model/kn_mice_lightgbm_dictionary_imputer.zip
Model Reader
Activate Conda Environmentbased on Operating SystemWindows or macOS
conda_environment_impute

Nodes

Extensions

Links