Short Summary
This script prepares a KNIME input table for machine learning by encoding categorical variables into numerical features using scikit-learn’s DictVectorizer.
Steps it performs:
Load data from KNIME and reset the index to preserve row order.
Identify feature groups: excluded columns, label column(s), numeric columns, categorical columns, and the rest.
Vectorize categorical features:
Convert categories to strings,
Apply DictVectorizer to one-hot encode them,
Combine the encoded categorical data with the rest of the dataset.
Generate metadata: Store lists of excluded, label, numeric, categorical, and transformed column names.
Save the vectorizer vocabulary and feature names as JSON files for later use.
Output results to KNIME:
A table with metadata,
The transformed training dataset,
The trained vectorizer object,
The metadata dictionary.
👉 In essence, this script transforms categorical columns into a machine-learning–ready numeric format (one-hot encoding), saves the mapping, and outputs both data and metadata back into KNIME.