Icon

ML Python 300 - Impute Numeric with Multiple Methods

<p>KNIME Python Script: Train and Save Multiple Iterative Imputation Models for Missing Data Handling for Numeric values</p><p>----</p><p>Short Summary</p><p>This script, designed for use inside a&nbsp;<strong>KNIME Python node</strong>, prepares tabular data for imputation by:</p><ol><li><p><strong>Loading input data</strong>&nbsp;from KNIME and extracting workflow variables such as the model path, excluded columns (like IDs), and the target label.</p></li><li><p><strong>Separating features</strong>&nbsp;into numerical and categorical columns, while storing metadata about excluded, label, numeric, categorical, and remaining columns.</p></li><li><p><strong>Initializing several machine learning regressors</strong>&nbsp;(ARDRegression, AdaBoost, Decision Trees, Extra Trees, KNN) as estimators for the&nbsp;<strong>IterativeImputer</strong>&nbsp;from scikit-learn.</p></li><li><p><strong>Training an imputer model for each estimator</strong>&nbsp;on the numeric features, then&nbsp;<strong>saving the trained imputers as compressed&nbsp;.pkl&nbsp;files</strong>&nbsp;(with LZMA compression) in the given path.</p></li><li><p><strong>Returning a dictionary</strong>&nbsp;of column classifications (excluded, label, numeric, categorical, rest) as the KNIME output object for downstream use.</p></li></ol><p>👉 In essence, it&nbsp;<strong>creates a library of imputation models</strong>&nbsp;to handle missing values using different algorithms and saves them for later application.</p>

URL: Handling “Missing Data” Like a Pro — Part 3: Model-Based & Multiple Imputation Methods https://towardsdatascience.com/handling-missing-data-like-a-pro-part-3-model-based-multiple-imputation-methods-bdfe85f93087
URL: MEDIUM BLOG - Data preparation for Machine Learning with KNIME and the Python “vtreat” package https://medium.com/lp/efcaf58fa783

KNIME Python Script: Train and Save Multiple Iterative Imputation Models for Missing Data Handling for Numeric values

Short Summary

This script, designed for use inside a KNIME Python node, prepares tabular data for imputation by:

  1. Loading input data from KNIME and extracting workflow variables such as the model path, excluded columns (like IDs), and the target label.

  2. Separating features into numerical and categorical columns, while storing metadata about excluded, label, numeric, categorical, and remaining columns.

  3. Initializing several machine learning regressors (ARDRegression, AdaBoost, Decision Trees, Extra Trees, KNN) as estimators for the IterativeImputer from scikit-learn.

  4. Training an imputer model for each estimator on the numeric features, then saving the trained imputers as compressed .pkl files (with LZMA compression) in the given path.

  5. Returning a dictionary of column classifications (excluded, label, numeric, categorical, rest) as the KNIME output object for downstream use.

👉 In essence, it creates a library of imputation models to handle missing values using different algorithms and saves them for later application.

MEDIUM BLOG

Data preparation for Machine Learning with KNIME and the Python “vtreat” package

https://medium.com/lp/efcaf58fa783

Learn the imputation models (numeric values only)

The results will be stored in the ../model/ folder. You can apply them in the lower section of the workflow to new /unseen data.

Apply the numeric imputation models learned in the code above
py_DTrees_imputer
Python Script
../data/train_missings.table
Table Reader
py_ARDRegression_imputer
String Configuration
py_DTrees_imputer
String Configuration
estimators = { # "bayesianridge": BayesianRidge(n_iter=25), "ARDRegression": ARDRegression(max_iter=25), "AdaBoostRegressor": AdaBoostRegressor(random_state=0), "DTrees": DecisionTreeRegressor(max_features='sqrt', random_state=0), "ETrees": ExtraTreesRegressor(n_estimators=100, random_state=0), "KNreg": KNeighborsRegressor(n_neighbors=5)}
Python Script
Merge Variables
py_ETrees_imputer
Python Script
/data/ /model/
determine paths
py_ETrees_imputer
String Configuration
locate and create /data/ folder with absolute paths
Collect Local Metadata
Merge Variables
py_KNreg_imputer
Python Script
py_KNreg_imputer
String Configuration
Table Difference Finder
Merge Variables
see the imputed values
Table Difference Finder
Merge Variables
Merge Variables
var_excluded_features row_id customer_number
String Configuration
py_AdaBoostRegressor_imputer
Python Script
Table Validator (Reference)
py_AdaBoostRegressor_imputer
String Configuration
../model/kn_multi_imputer_dictionary.zip
Model Reader
var_labelTarget
String Configuration
../model/kn_multi_imputer_dictionary.zip
Model Writer
../data/test_missings.table
Table Reader
Column Filter
Activate Conda Environmentbased on Operating SystemWindows or macOS
conda_environment_impute
Merge Variables
py_ARDRegression_imputer
Python Script

Nodes

Extensions

Links