Icon

ML Python 001 - Impute Missing Values - Prepare Data

<p>KNIME and Python - Missing value imputation 001 - randomly delete some string and numeric values from the training dataset</p>

URL: Kaggle House Prices: Advanced Regression Techniques https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
URL: MEDIUM BLOG - Data preparation for Machine Learning with KNIME and the Python “vtreat” package https://medium.com/lp/efcaf58fa783

Kaggle House Prices: Advanced Regression Techniques

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

KNIME and Python - Missing value imputation 001 - randomly delete some string and numeric values from the training dataset
have a row_id as an integer and also store the ID for further use.
row_id starting with 0 would be in sync with Pandas

create the subfolders for the workflow group
/data/
/model/
/script/ - containing Python Jupyter notebooks also doing the imputations

MEDIUM BLOG

Data preparation for Machine Learning with KNIME and the Python “vtreat” package

https://medium.com/lp/efcaf58fa783

START TEST
Column List Loop Start
../data/test_missings.table
Table Writer
restore the original order of columns
Reference Column Resorter
../data/impute_intermediate_test.table
Table Reader
split70/30
Table Partitioner
../data/
Create Folder
../data/train.csv
CSV Reader
String to Number
row_id
RowID
construct file pathsPfade zusammensetzen
Java Edit Variable
../data/train.table
Table Writer
../data/test.table
Table Writer
../model/
Create Folder
../script/
Create Folder
../data/impute_intermediate_train.table delete intermediate table
Delete Files/Folders
locate and create /data/ folder with absolute paths
Collect Local Metadata
Type based filter keep rest of numeric variables NUMERIC TRAIN
Column Filter
Rest columns TRAIN
Reference Column Filter
create 12% missing values randomly per numeric colum
Table Partitioner
remove numeric variables you do not want to create missings for like Target and row_id
Column Filter
Catch Errors (Var Ports)
START TRAIN
Column List Loop Start
Try (Variable Ports)
bring the data back together
Concatenate
Merge Variables
row_id to integer
String to Number (PMML)
Merge Variables
Table Transposer
Merge Variables
Catch Errors (Var Ports)
remove the column for the 12% so it will become a missing
Column Filter
replace all "NA" with 'real' missings TRAIN 70
String Manipulation (Multi Column)
Merge Variables
Try (Variable Ports)
Counter Generation
../data/train_missings.table
Table Writer
../data/impute_intermediate_test.table delete intermediate table
Delete Files/Folders
Rest columns TEST
Reference Column Filter
replace all "NA" with 'real' missings TEST 30
String Manipulation (Multi Column)
NUMERIC TEST
Reference Column Filter
../data/impute_intermediate_test.table store the rest of the data in an intermediate file
Table Writer
STOP TEST
Variable Loop End
../data/impute_intermediate_train.table store the rest of the data in an intermediate file
Table Writer
bring the row_id back but keep it
RowID
Sorter
join the manipulated column back with the rest
Joiner
row_id
RowID
../data/impute_intermediate_train.table keep the existing row_id
Table Reader
../data/impute_intermediate_train.table
Table Reader
../data/train.parquet
Parquet Writer
restore the original order of columns
Reference Column Resorter
../data/test.parquet
Parquet Writer
../data/impute_intermediate_train.table
Table Writer
../data/train_missings.parquet
Parquet Writer
STOP TRAIN
Variable Loop End
../data/test_missings.parquet
Parquet Writer
../data/impute_intermediate_test.table
Table Reader
remove the column for the 12% so it will become a missing
Column Filter
TargetSalePrice
Column Renamer
Column Name Extractor
bring the data back together
Concatenate
../data/impute_intermediate_test.table
Table Writer
create 12% missing values randomly per numeric colum
Table Partitioner
join the manipulated column back with the rest
Joiner
Sorter
bring the row_id back but keep it
RowID

Nodes

Extensions

Links