Icon

02_​Missing_​Value_​Handling_​exercise

Missing Value Handling - exercise

Introduction to Machine Learning Algorithms course - Session 4
Exercise 2
Handle missing values in the data by
- Setting them to a fixed value (zero)
- Generating a dummy column based on missing values in another column
- Replacing them with the column mean or the most frequent value in the column
- Looking for different missing value patterns in the data
- Filtering out columns that have many missing values

Create Garage column (RuleEngine node) Handle missing values (Missing Value node)- Remove row with string or integer missing values- Set "Lot frontage" and "Mas Vnr Area" missing values to 0- Impute "Garage Yr Blt" missing value to the column mean Optional:- Handle the NAs as missing values (Column AutoType Cast node)- Remove columns with more than 30% missingvalues (Missing Value Column Filter node) Handle remaining missing values (MissingValue node)- Set the numeric missing values to the columnmean and the string to the most frequent value Apply the above missing value handling to the test set (Rule Engine and MissingValue (Apply) nodes) - Handle the NAs as missing values in the test set(Column Auto Type Cast node)- Remove the discarded columns from the test set(Reference Column Filter node) Apply the same missing value handling to thetest set (Missing Value (Apply) node) Exercise: Missing Value Handling1) The "Garage Yr Blt" column (year when the garage was built) has many missing values. This means probably that these houses don'thave a garage. Therefore:- Create a new feature "Garage" that gets the value 0 if the "Garage Yr Blt" feature is missing and 1 otherwise (Rule Engine node)2) The dataset contains at least two columns with many missing values: "Lot Frontage" (linear feet of street connected to the property) and"Mas Vnr Area" (masonry veneer area).- Set the missings values in these columns to zero (Missing Value node)- Moreover, replace the missing values in the "Garage Yr Blt" column with the column mean (Missing Value node)- Other missing values are few. Remove the remaining rows containing missing values (Missing Value node)3) Apply the same missing value handling to the test set (Rule Engine and Missing Value (Apply) nodes)Optional: Take a look, for example, at the columns "Pool QC", "Fence", and "Misc Feature". They contain many "NA" values. This is probablybecause these houses don't have a pool, fence, or another miscellaneous feature for which to report the quality. So let's also handle these"NA" values as missing values.1) Change "NAs" into missing values (Column Auto Type Cast node)- Include all columns- Set the missing value pattern to "NA"2) Filter out columns that have more than 30% missing values (Missing Value Column Filter node)3) Set the remaining missing values to the column mean (numeric columns) or the most frequent value (String columns)4) Apply the same missing value handling to the test set (Reference Column Filter and Missing Value (Apply) nodes) Read AmesHousing.csv CSV Reader Preprocessing Create Garage column (RuleEngine node) Handle missing values (Missing Value node)- Remove row with string or integer missing values- Set "Lot frontage" and "Mas Vnr Area" missing values to 0- Impute "Garage Yr Blt" missing value to the column mean Optional:- Handle the NAs as missing values (Column AutoType Cast node)- Remove columns with more than 30% missingvalues (Missing Value Column Filter node) Handle remaining missing values (MissingValue node)- Set the numeric missing values to the columnmean and the string to the most frequent value Apply the above missing value handling to the test set (Rule Engine and MissingValue (Apply) nodes) - Handle the NAs as missing values in the test set(Column Auto Type Cast node)- Remove the discarded columns from the test set(Reference Column Filter node) Apply the same missing value handling to thetest set (Missing Value (Apply) node) Exercise: Missing Value Handling1) The "Garage Yr Blt" column (year when the garage was built) has many missing values. This means probably that these houses don'thave a garage. Therefore:- Create a new feature "Garage" that gets the value 0 if the "Garage Yr Blt" feature is missing and 1 otherwise (Rule Engine node)2) The dataset contains at least two columns with many missing values: "Lot Frontage" (linear feet of street connected to the property) and"Mas Vnr Area" (masonry veneer area).- Set the missings values in these columns to zero (Missing Value node)- Moreover, replace the missing values in the "Garage Yr Blt" column with the column mean (Missing Value node)- Other missing values are few. Remove the remaining rows containing missing values (Missing Value node)3) Apply the same missing value handling to the test set (Rule Engine and Missing Value (Apply) nodes)Optional: Take a look, for example, at the columns "Pool QC", "Fence", and "Misc Feature". They contain many "NA" values. This is probablybecause these houses don't have a pool, fence, or another miscellaneous feature for which to report the quality. So let's also handle these"NA" values as missing values.1) Change "NAs" into missing values (Column Auto Type Cast node)- Include all columns- Set the missing value pattern to "NA"2) Filter out columns that have more than 30% missing values (Missing Value Column Filter node)3) Set the remaining missing values to the column mean (numeric columns) or the most frequent value (String columns)4) Apply the same missing value handling to the test set (Reference Column Filter and Missing Value (Apply) nodes) Read AmesHousing.csv CSV Reader Preprocessing

Nodes

Extensions

Links