Icon

03_​Data_​Cleaning

Exercise 3: Data CleaningIn this exercise you explore data, check the data for outliers, and handle missing values. 1) Explore the properties of thedata in the interactive DataExplorer view 2) Replace numeric outliers inthe "Age" column, i.e the valuesoutside the whiskers of aboxplot, by missing values 4a) Handle missing values 4.1 If the age of a customer is missing, replace thebirthday with a missing value (Rule Engine node)Hint: Use the expression NOT MISSING $Age$=>$Birthday$ 4.2 Impute the missing values in the age column with thecolumn mean4b) Remove duplicates in the customer data 3a) Create the lookup table to correct spelling mistakes manually 3.1 Calculate the number of occurrences by country (Value Counter node) 3.2 Extract the country name information from the RowID into a new column (RowID node) 3.3 Split the table based on the number of occurrences of the country names (<30 and >=30) (Row Splitter node) 3.4 In the table with the misspelling candidates, generate a constant value (an empty string for now) column called"Correct Country" (Constant Value Column node) 3.5 Correct the names of the misspelled countries by typing the correct names in the "Correct Country" column inthe Table Editor view (Table Editor node) 3.6 Replace the misspelled country names in the customer information table (Cell Replacer node) 3b) (optional): Create the lookup table to correct spelling mistakes automatically 3.7 Generate a numeric ID for each country in the lookup table (Counter Generation). 3.8 Search for the most similar country for each misspelling candidate (Similarity Search). Output the numeric ID.(Similarity Search node) 3.9 Join the similarity search output with the correct country names based on the numeric ID. 3.10 Replace the misspelled country names in the customer information table Correct SpellingMistakes based on Lookup-TableMissing(Age)= meanGet an OverviewCount CountryCountryFrom RowID to ColumnRemove Duplicate CustomersUpper: With MistakesLower: Without MistakesCustomer information coming from two systemsGenerate numeric IDfor each country Table Editor ConstantValue Column Cell Replacer Similarity Search Joiner Numeric Outliers Missing Value Data Explorer Value Counter RowID DuplicateRow Filter Rule Engine Row Splitter CustomerInformation Counter Generation Exercise 3: Data CleaningIn this exercise you explore data, check the data for outliers, and handle missing values. 1) Explore the properties of thedata in the interactive DataExplorer view 2) Replace numeric outliers inthe "Age" column, i.e the valuesoutside the whiskers of aboxplot, by missing values 4a) Handle missing values 4.1 If the age of a customer is missing, replace thebirthday with a missing value (Rule Engine node)Hint: Use the expression NOT MISSING $Age$=>$Birthday$ 4.2 Impute the missing values in the age column with thecolumn mean4b) Remove duplicates in the customer data 3a) Create the lookup table to correct spelling mistakes manually 3.1 Calculate the number of occurrences by country (Value Counter node) 3.2 Extract the country name information from the RowID into a new column (RowID node) 3.3 Split the table based on the number of occurrences of the country names (<30 and >=30) (Row Splitter node) 3.4 In the table with the misspelling candidates, generate a constant value (an empty string for now) column called"Correct Country" (Constant Value Column node) 3.5 Correct the names of the misspelled countries by typing the correct names in the "Correct Country" column inthe Table Editor view (Table Editor node) 3.6 Replace the misspelled country names in the customer information table (Cell Replacer node) 3b) (optional): Create the lookup table to correct spelling mistakes automatically 3.7 Generate a numeric ID for each country in the lookup table (Counter Generation). 3.8 Search for the most similar country for each misspelling candidate (Similarity Search). Output the numeric ID.(Similarity Search node) 3.9 Join the similarity search output with the correct country names based on the numeric ID. 3.10 Replace the misspelled country names in the customer information table Correct SpellingMistakes based on Lookup-TableMissing(Age)= meanGet an OverviewCount CountryCountryFrom RowID to ColumnRemove Duplicate CustomersUpper: With MistakesLower: Without MistakesCustomer information coming from two systemsGenerate numeric IDfor each country Table Editor ConstantValue Column Cell Replacer Similarity Search Joiner Numeric Outliers Missing Value Data Explorer Value Counter RowID DuplicateRow Filter Rule Engine Row Splitter CustomerInformation Counter Generation

Nodes

Extensions

Links