Icon

1. Data Preparation

<p><strong>Data Preparation</strong></p><p>This workflow prepares the data for the next workflow ("My first Data Model") and uses some of the most common nodes for data preparation:</p><ul><li><p>Applying different strategies for missing values (<em>Missing Value</em> node)</p></li><li><p>Creating subsets of the data (<em>Row Sampler</em> and <em>Table Partitioner</em> nodes)</p></li><li><p>Shuffling (<em>Shuffle </em>node)</p></li><li><p>Concatenation of data sets (<em>Concatenate </em>node)</p></li><li><p>Normalizing data (<em>Normalizer </em>and <em>Normalizer (Apply)</em> nodes)</p></li></ul><p>After preprocessing, the workflow writes the two subsets back to .csv files, one for the training set (top partitioning), one for test set (bottom partitioning).</p>

URL: KNIME Beginner's Luck (Book Homepage) https://www.knime.com/knimepress/beginners-luck

Workflow: Data Preparation


This workflow prepares the data for the next workflow ("My first Data Model") and uses some of the most common nodes for data preparation:

  • Applying different strategies for missing values (Missing Value node)

  • Creating subsets of the data (Row Sampler and Table Partitioner nodes)

  • Shuffling (Shuffle node)

  • Concatenation of data sets (Concatenate node)

  • Normalizing data (Normalizer and Normalizer (Apply) nodes)

After preprocessing, the workflow writes the two subsets back to .csv files, one for the training set (top partitioning), one for test set (bottom partitioning).

Reading data

Missing value handling

  • For all integer columns: replace with 0

  • For "age": replace with mean value

  • For "sex": remove row

  • For "workclass": remove row

Creating subsets

Shuffling

Concatenate

Writing data

Normalizing data

20% subsetrandomly drawn with seed
Row Sampler
50% splitdrawn with linear sampling
Table Partitioner
adult_training_set.csv
CSV Writer
Shuffle data randomlyno seed
Shuffle
adult_test_set.csv
CSV Writer
Combine top +bottom subset
Concatenate
Missing value handling
Missing Value
Apply z-score normalizationto training set (top partitioning)
Normalizer
adult.csv with column headers
CSV Reader
Apply normalizationto test set (bottom partitioning)
Normalizer (Apply)

Nodes

Extensions

Links