Icon

02 Cleaning and Standardization

Cleaning and Standardization - Exercise

This workflow shows a hands-on exercise in the L1-DS Introduction to KNIME Analytics Platform for Data Scientists - Basics course

Task 1: Row Filtering1. Read the adult.csv file by executing the CSV Reader node2. Filter out rows where the marital status is missing3. Extract rows where - the marital status is divorced- the marital status is never married and age is between 20 and 40 (both included)- the workclass starts with "S" Task 2: Column Filtering1. Read the adult_education.table file by executing the Table Reader node2. Exclude the "education-num" column- manually- by including only string type columns Task 3: Data Transformation1. Work with the adult.csv data again and create a new column "work-status" with thevalue "full-time" if the weekly working hours are >=40 and "part-time" otherwise2. Replace the hyphen in "United-States" by a space character in the "native-country"column3. Create a new column "year-of-birth" by substracting the age number from 1994,which is the year when the data were collected 4. OPTIONAL: Replicate the tasks 3 & 4 with the Column Expressions node Filter out rows where the marital status is missingDivorcedWorkclass starts with "S"Never-marriedNever married and ageis between 20 and 40Exclude education-nummanuallyExclude education-numby type selectionadult.csvRead adult_education.tableCreate work statuscolumnReplace "-"by "" in the nativecountry columnyear-of-birthcolumnComplete the same tasks as above Row Filter Row Filter Row Filter Row Filter Row Filter Column Filter Column Filter CSV Reader Table Reader Rule Engine String Manipulation Math Formula Column Expressions Task 1: Row Filtering1. Read the adult.csv file by executing the CSV Reader node2. Filter out rows where the marital status is missing3. Extract rows where - the marital status is divorced- the marital status is never married and age is between 20 and 40 (both included)- the workclass starts with "S" Task 2: Column Filtering1. Read the adult_education.table file by executing the Table Reader node2. Exclude the "education-num" column- manually- by including only string type columns Task 3: Data Transformation1. Work with the adult.csv data again and create a new column "work-status" with thevalue "full-time" if the weekly working hours are >=40 and "part-time" otherwise2. Replace the hyphen in "United-States" by a space character in the "native-country"column3. Create a new column "year-of-birth" by substracting the age number from 1994,which is the year when the data were collected 4. OPTIONAL: Replicate the tasks 3 & 4 with the Column Expressions node Filter out rows where the marital status is missingDivorcedWorkclass starts with "S"Never-marriedNever married and ageis between 20 and 40Exclude education-nummanuallyExclude education-numby type selectionadult.csvRead adult_education.tableCreate work statuscolumnReplace "-"by "" in the nativecountry columnyear-of-birthcolumnComplete the same tasks as above Row Filter Row Filter Row Filter Row Filter Row Filter Column Filter Column Filter CSV Reader Table Reader Rule Engine String Manipulation Math Formula Column Expressions

Nodes

Extensions

Links