Icon

03_​Data_​Cleaning_​Solution

Data Cleaning - Solution

Solution to exercise 3 for the KNIME Analytics Platform for Data Wranglers course
- Explore the properties of the data with the Data Explorer node
- Manage numeric outliers
- Correct spelling mistakes manually and automatically
- Handle missing values
- Remove duplicate values



Exercise 3: Data CleaningIn this exercise you explore data, check the data for outliers, and handle missing values. 1) Explore the properties of the data in theinteractive Data Explorer view 2) Replace numeric outliers in the "Age"column, i.e the values outside the whiskersof a boxplot, by missing values 4a) Handle missing values 4.1 If the age of a customer is missing, replace the birthday with a missingvalue (Rule Engine node)Hint: Use the expression NOT MISSING $Age$=> $Birthday$ 4.2 Impute the missing values in the age column with the column mean4b) Remove duplicates in the customer data 3a) Create the lookup table to correct spelling mistakes manually 3.1 Calculate the number of occurrences by country (Value Counter node) 3.2 Extract the country name information from the RowID into a new column (RowID node) 3.3 Split the table based on the number of occurrences of the country names (<30 and >=30) (Row Splitter node) 3.4 In the table with the misspelling candidates, generate a constant value (an empty string for now) column called "Correct Country" (Constant ValueColumn node) 3.5 Correct the names of the misspelled countries by typing the correct names in the "Correct Country" column in the Table Editor view (Table Editornode) 3.6 Replace the misspelled country names in the customer information table (Cell Replacer node) 3b) (optional): Create the lookup table to correct spelling mistakes automatically 3.7 Generate a numeric ID for each country in the lookup table (Counter Generation). 3.8 Search for the most similar country for each misspelling candidate (Similarity Search). Output the numeric ID. (Similarity Search node) 3.9 Join the similarity search output with the correct country names based on the numeric ID. 3.10 Replace the misspelled country names in the customer information table Correct SpellingMistakes based on Lookup-TableMissing(Age)= meanGet an OverviewCount CountryCountryFrom RowID to ColumnRemove Duplicate CustomersUpper: With MistakesLower: Without MistakesGenerate numeric IDfor each countryCustomer information coming from two systems Table Editor ConstantValue Column Cell Replacer Similarity Search Joiner Numeric Outliers Missing Value Data Explorer Value Counter RowID DuplicateRow Filter Rule Engine Row Splitter Counter Generation CustomerInformation Exercise 3: Data CleaningIn this exercise you explore data, check the data for outliers, and handle missing values. 1) Explore the properties of the data in theinteractive Data Explorer view 2) Replace numeric outliers in the "Age"column, i.e the values outside the whiskersof a boxplot, by missing values 4a) Handle missing values 4.1 If the age of a customer is missing, replace the birthday with a missingvalue (Rule Engine node)Hint: Use the expression NOT MISSING $Age$=> $Birthday$ 4.2 Impute the missing values in the age column with the column mean4b) Remove duplicates in the customer data 3a) Create the lookup table to correct spelling mistakes manually 3.1 Calculate the number of occurrences by country (Value Counter node) 3.2 Extract the country name information from the RowID into a new column (RowID node) 3.3 Split the table based on the number of occurrences of the country names (<30 and >=30) (Row Splitter node) 3.4 In the table with the misspelling candidates, generate a constant value (an empty string for now) column called "Correct Country" (Constant ValueColumn node) 3.5 Correct the names of the misspelled countries by typing the correct names in the "Correct Country" column in the Table Editor view (Table Editornode) 3.6 Replace the misspelled country names in the customer information table (Cell Replacer node) 3b) (optional): Create the lookup table to correct spelling mistakes automatically 3.7 Generate a numeric ID for each country in the lookup table (Counter Generation). 3.8 Search for the most similar country for each misspelling candidate (Similarity Search). Output the numeric ID. (Similarity Search node) 3.9 Join the similarity search output with the correct country names based on the numeric ID. 3.10 Replace the misspelled country names in the customer information table Correct SpellingMistakes based on Lookup-TableMissing(Age)= meanGet an OverviewCount CountryCountryFrom RowID to ColumnRemove Duplicate CustomersUpper: With MistakesLower: Without MistakesGenerate numeric IDfor each countryCustomer information coming from two systems Table Editor ConstantValue Column Cell Replacer Similarity Search Joiner Numeric Outliers Missing Value Data Explorer Value Counter RowID DuplicateRow Filter Rule Engine Row Splitter Counter Generation CustomerInformation

Nodes

Extensions

Links