Icon

03_​Data_​Cleaning_​Exercise

Data Cleaning - Solution

Solution to exercise 3 for the KNIME Analytics Platform for Data Wranglers course
- Explore the properties of the data with the Data Explorer node
- Manage numeric outliers
- Correct spelling mistakes manually and automatically
- Handle missing values
- Remove duplicate values



Exercise 3: Data CleaningIn this exercise you explore data, check the data for outliers, and handle missing values. 1) Explore the properties of thedata in the interactive DataExplorer view 2) Replace numeric outliers inthe "Age" column, i.e the valuesoutside the whiskers of aboxplot, by missing values 4a) (optional) Create the lookup table to correct spelling mistakes manually via the following steps: 4.1 Calculate the number of occurrences by country (Value Counter node) 4.2 Extract the country name from the RowID into a new column (RowID node) 4.3 Split the table based on the number of occurrences of the country names (<=10 and >10) (Row Splitter node) 4.4 In the table with the misspelling candidates, generate a constant value (an empty string for now) column called"Correct Country" (Constant Value Column node) 4.5 Correct the names of the misspelled countries by typing the correct names in the "Correct Country" column inthe Table Editor view (Table Editor node) 4.6 Replace the misspelled country names in the customer information table (Cell Replacer node) 4b) (optional) Create the lookup table to correct spelling mistakes automatically via the following steps: 4.7 Generate a numeric ID for each country in the lookup table (Counter Generation) 4.8 Search for the most similar country for each misspelling candidate. Output the numeric ID. (Similarity Searchnode) 4.9 Join the similarity search output with the correct country names based on the numeric ID 4.10 Replace the misspelled country names in the customer information table 3a) Replace the birthday with a missing value if age ismissing (Rule Engine node)Hint: Use the expression NOT MISSING $Age$=>$Birthday$3b) Replace the missing values in the age column withthe column mean3c) Remove rows for duplicate customer IDs Customer information coming from two systems CustomerInformation Exercise 3: Data CleaningIn this exercise you explore data, check the data for outliers, and handle missing values. 1) Explore the properties of thedata in the interactive DataExplorer view 2) Replace numeric outliers inthe "Age" column, i.e the valuesoutside the whiskers of aboxplot, by missing values 4a) (optional) Create the lookup table to correct spelling mistakes manually via the following steps: 4.1 Calculate the number of occurrences by country (Value Counter node) 4.2 Extract the country name from the RowID into a new column (RowID node) 4.3 Split the table based on the number of occurrences of the country names (<=10 and >10) (Row Splitter node) 4.4 In the table with the misspelling candidates, generate a constant value (an empty string for now) column called"Correct Country" (Constant Value Column node) 4.5 Correct the names of the misspelled countries by typing the correct names in the "Correct Country" column inthe Table Editor view (Table Editor node) 4.6 Replace the misspelled country names in the customer information table (Cell Replacer node) 4b) (optional) Create the lookup table to correct spelling mistakes automatically via the following steps: 4.7 Generate a numeric ID for each country in the lookup table (Counter Generation) 4.8 Search for the most similar country for each misspelling candidate. Output the numeric ID. (Similarity Searchnode) 4.9 Join the similarity search output with the correct country names based on the numeric ID 4.10 Replace the misspelled country names in the customer information table 3a) Replace the birthday with a missing value if age ismissing (Rule Engine node)Hint: Use the expression NOT MISSING $Age$=>$Birthday$3b) Replace the missing values in the age column withthe column mean3c) Remove rows for duplicate customer IDs Customer information coming from two systems CustomerInformation

Nodes

Extensions

Links