Icon

03_​Data_​Cleaning_​Exercise

Data Cleaning - Solution

Solution to exercise 3 for the KNIME Analytics Platform for Data Wranglers course
- Explore the properties of the data with the Data Explorer node
- Manage numeric outliers
- Correct spelling mistakes manually and automatically
- Handle missing values
- Remove duplicate values



Exercise 3: Data CleaningIn this exercise you explore data, check the data for outliers, and handle missing values. 1) Explore the properties of thedata in the interactive DataExplorer view 2) Replace numeric outliers inthe "Age" column, i.e the valuesoutside the whiskers of aboxplot, by missing values 4a) (optional) Create the lookup table to correct spelling mistakes manually 4.1 Calculate the number of occurrences by country (Value Counter node) 4.2 Extract the country name information from the RowID into a new column (RowID node) 4.3 Split the table based on the number of occurrences of the country names (<=10 and >10) (Row Splitter node) 4.4 In the table with the misspelling candidates, generate a constant value (an empty string for now) column called"Correct Country" (Constant Value Column node) 4.5 Correct the names of the misspelled countries by typing the correct names in the "Correct Country" column inthe Table Editor view (Table Editor node) 4.6 Replace the misspelled country names in the customer information table (Cell Replacer node) 4b) (optional): Create the lookup table to correct spelling mistakes automatically 4.7 Generate a numeric ID for each country in the lookup table (Counter Generation). 4.8 Search for the most similar country for each misspelling candidate (Similarity Search). Output the numeric ID.(Similarity Search node) 4.9 Join the similarity search output with the correct country names based on the numeric ID. 4.10 Replace the misspelled country names in the customer information table 3a) Handle missing values 3.1 If the age of a customer is missing, replace thebirthday with a missing value (Rule Engine node)Hint: Use the expression NOT MISSING $Age$=>$Birthday$ 3.2 Impute the missing values in the age column with thecolumn mean3b) Remove duplicates in the customer data Customer information coming from two systems CustomerInformation Exercise 3: Data CleaningIn this exercise you explore data, check the data for outliers, and handle missing values. 1) Explore the properties of thedata in the interactive DataExplorer view 2) Replace numeric outliers inthe "Age" column, i.e the valuesoutside the whiskers of aboxplot, by missing values 4a) (optional) Create the lookup table to correct spelling mistakes manually 4.1 Calculate the number of occurrences by country (Value Counter node) 4.2 Extract the country name information from the RowID into a new column (RowID node) 4.3 Split the table based on the number of occurrences of the country names (<=10 and >10) (Row Splitter node) 4.4 In the table with the misspelling candidates, generate a constant value (an empty string for now) column called"Correct Country" (Constant Value Column node) 4.5 Correct the names of the misspelled countries by typing the correct names in the "Correct Country" column inthe Table Editor view (Table Editor node) 4.6 Replace the misspelled country names in the customer information table (Cell Replacer node) 4b) (optional): Create the lookup table to correct spelling mistakes automatically 4.7 Generate a numeric ID for each country in the lookup table (Counter Generation). 4.8 Search for the most similar country for each misspelling candidate (Similarity Search). Output the numeric ID.(Similarity Search node) 4.9 Join the similarity search output with the correct country names based on the numeric ID. 4.10 Replace the misspelled country names in the customer information table 3a) Handle missing values 3.1 If the age of a customer is missing, replace thebirthday with a missing value (Rule Engine node)Hint: Use the expression NOT MISSING $Age$=>$Birthday$ 3.2 Impute the missing values in the age column with thecolumn mean3b) Remove duplicates in the customer data Customer information coming from two systems CustomerInformation

Nodes

Extensions

Links