Icon

ML Python 100 - Impute Categorical with Dictionary Vectorizer

<p>KNIME Python Script: Convert Categorical Features with DictVectorizer and Store Metadata</p><p>----</p><p>Short Summary</p><p>This script prepares a KNIME input table for machine learning by&nbsp;<strong>encoding categorical variables into numerical features</strong>&nbsp;using scikit-learn’s&nbsp;DictVectorizer.</p><p>Steps it performs:</p><ol><li><p><strong>Load data</strong>&nbsp;from KNIME and reset the index to preserve row order.</p></li><li><p><strong>Identify feature groups</strong>: excluded columns, label column(s), numeric columns, categorical columns, and the rest.</p></li><li><p><strong>Vectorize categorical features</strong>:</p><ul><li><p>Convert categories to strings,</p></li><li><p>Apply&nbsp;DictVectorizer&nbsp;to one-hot encode them,</p></li><li><p>Combine the encoded categorical data with the rest of the dataset.</p></li></ul></li><li><p><strong>Generate metadata</strong>: Store lists of excluded, label, numeric, categorical, and transformed column names.</p></li><li><p><strong>Save the vectorizer vocabulary and feature names as JSON files</strong>&nbsp;for later use.</p></li><li><p><strong>Output results to KNIME</strong>:</p><ul><li><p>A table with metadata,</p></li><li><p>The transformed training dataset,</p></li><li><p>The trained vectorizer object,</p></li><li><p>The metadata dictionary.</p></li></ul></li></ol><p>👉 In essence, this script&nbsp;<strong>transforms categorical columns into a machine-learning–ready numeric format (one-hot encoding), saves the mapping, and outputs both data and metadata back into KNIME</strong>.</p>

URL: Medium: Data preparation for Machine Learning with KNIME and the Python “vtreat” package https://medium.com/low-code-for-advanced-data-science/data-preparation-for-machine-learning-with-knime-and-the-python-vtreat-package-efcaf58fa783

KNIME Python Script: Convert Categorical Features with DictVectorizer and Store Metadata

=> this is just to demonstrate how this can be done. I a real world scenario you might have to add dimension reduction and also deal with missing numeric values.

You can take a look at the use of vtreat (https://medium.com/lp/efcaf58fa783)

MEDIUM BLOG

Data preparation for Machine Learning with KNIME and the Python “vtreat” package

https://medium.com/lp/efcaf58fa783

Short Summary

This script prepares a KNIME input table for machine learning by encoding categorical variables into numerical features using scikit-learn’s DictVectorizer.

Steps it performs:

  1. Load data from KNIME and reset the index to preserve row order.

  2. Identify feature groups: excluded columns, label column(s), numeric columns, categorical columns, and the rest.

  3. Vectorize categorical features:

    • Convert categories to strings,

    • Apply DictVectorizer to one-hot encode them,

    • Combine the encoded categorical data with the rest of the dataset.

  4. Generate metadata: Store lists of excluded, label, numeric, categorical, and transformed column names.

  5. Save the vectorizer vocabulary and feature names as JSON files for later use.

  6. Output results to KNIME:

    • A table with metadata,

    • The transformed training dataset,

    • The trained vectorizer object,

    • The metadata dictionary.

👉 In essence, this script transforms categorical columns into a machine-learning–ready numeric format (one-hot encoding), saves the mapping, and outputs both data and metadata back into KNIME.

../data/train_missings.table
Table Reader
../data/test_missings.table
Table Reader
Python Script
exclude Target, row_id
Column Filter
../model/kn_dictvec_variable_dictionary.zip
Model Reader
row_id restored
RowID
Python Script
../model/kn_dictvec_vectorizer.zip
Model Reader
../data/kn_train_dictvect.table
Table Writer
../data/kn_test_dictvect.table
Table Writer
row_id restored
RowID
var_label"Target"
String Configuration
/data/ /model/
determine paths
var_excluded_features row_id customer_number
String Configuration
../model/kn_dictvec_variable_dictionary.zip
Model Writer
exclude Target, row_id
Column Filter
../model/kn_dictvec_vectorizer.zip
Model Writer
Merge Variables
locate and create /data/ folder with absolute paths
Collect Local Metadata
keep only Target, row_id and the numeric columns(the original ones and theOneHot Encoded)
Reference Column Filter
../model/kn_dictvec_variable_dictionary.table
Table Writer
Reference Column Filter
Activate Conda Environmentbased on Operating SystemWindows or macOS
conda_environment_impute

Nodes

Extensions

Links