Icon

03_​Data_​Cleaning

Data Cleaning - Exercise
Exercise 3: Data CleaningIn this exercise you explore data, check the data for outliers, and handle missing values. 1) Explore the properties of the data in theinteractive Data Explorer view 2) Replace numeric outliers in the "Age"column, i.e the values outside the whiskersof a boxplot, by missing values 4a) Handle missing values 4.1 If the age of a customer is missing, replace the birthday with a missingvalue (Rule Engine node)Hint: Use the expression NOT MISSING $Age$=> $Birthday$ 4.2 Impute the missing values in the age column with the column mean4b) Remove duplicates in the customer data 3a) Create the lookup table to correct spelling mistakes manually 3.1 Calculate the number of occurrences by country (Value Counter node) 3.2 Extract the country name information from the RowID into a new column (RowID node) 3.3 Split the table based on the number of occurrences of the country names (<30 and >=30) (Row Splitter node) 3.4 In the table with the misspelling candidates, generate a constant value (an empty string for now) column called "Correct Country" (Constant ValueColumn node) 3.5 Correct the names of the misspelled countries by typing the correct names in the "Correct Country" column in the Table Editor view (Table Editornode) 3.6 Replace the misspelled country names in the customer information table (Cell Replacer node) 3b) (optional): Create the lookup table to correct spelling mistakes automatically 3.7 Generate a numeric ID for each country in the lookup table (Counter Generation). 3.8 Search for the most similar country for each misspelling candidate (Similarity Search). Output the numeric ID. (Similarity Search node) 3.9 Join the similarity search output with the correct country names based on the numeric ID. 3.10 Replace the misspelled country names in the customer information table Customer information coming from two systems CustomerInformation Exercise 3: Data CleaningIn this exercise you explore data, check the data for outliers, and handle missing values. 1) Explore the properties of the data in theinteractive Data Explorer view 2) Replace numeric outliers in the "Age"column, i.e the values outside the whiskersof a boxplot, by missing values 4a) Handle missing values 4.1 If the age of a customer is missing, replace the birthday with a missingvalue (Rule Engine node)Hint: Use the expression NOT MISSING $Age$=> $Birthday$ 4.2 Impute the missing values in the age column with the column mean4b) Remove duplicates in the customer data 3a) Create the lookup table to correct spelling mistakes manually 3.1 Calculate the number of occurrences by country (Value Counter node) 3.2 Extract the country name information from the RowID into a new column (RowID node) 3.3 Split the table based on the number of occurrences of the country names (<30 and >=30) (Row Splitter node) 3.4 In the table with the misspelling candidates, generate a constant value (an empty string for now) column called "Correct Country" (Constant ValueColumn node) 3.5 Correct the names of the misspelled countries by typing the correct names in the "Correct Country" column in the Table Editor view (Table Editornode) 3.6 Replace the misspelled country names in the customer information table (Cell Replacer node) 3b) (optional): Create the lookup table to correct spelling mistakes automatically 3.7 Generate a numeric ID for each country in the lookup table (Counter Generation). 3.8 Search for the most similar country for each misspelling candidate (Similarity Search). Output the numeric ID. (Similarity Search node) 3.9 Join the similarity search output with the correct country names based on the numeric ID. 3.10 Replace the misspelled country names in the customer information table Customer information coming from two systems CustomerInformation

Nodes

Extensions

Links