Icon

05_​Missing_​Value_​Handling

Missing Value Handling

Introduction to Machine Learning Algorithms course - Session 4
Exercise 2
Handle missing values in the data by
- Setting them to a fixed value (zero)
- Generating a dummy column based on missing values in another column
- Replacing them with the column mean or the most frequent value in the column
- Looking for different missing value patterns in the data
- Filtering out columns that have many missing values

Exercise: Missing Value Handling1) The "Garage Yr Blt" column (year when the garage was built) has many missing values. This means probably that these houses don't have a garage. Therefore:- Create a new feature "Garage" that gets the value 0 if the "Garage Yr Blt" feature is missing and 1 otherwise (Rule Engine node)2) The dataset contains at least two columns with many missing values: "Lot Frontage" (linear feet of street connected to the property) and "Mas Vnr Area" (masonry veneer area).- Set the missings values in these columns to zero. (Missing Value node)- Moreover, replace the missing values in the "Garage Yr Blt" column with the column mean- Other missing values are few. Remove the remaining rows containing missing values.3) Apply the same missing value handling to the test set (Missing Value (Apply) and Rule Engine nodes)Optional: Take a look, for example, at the columns "Pool QC", "Fence", and "Misc Feature". They contain many "NA" values. This is probably because these houses don't have a pool,fence, or another miscellaneous feature for which to report the quality. So let's also handle these "NA" values as missing values.1) Change "NAs" into missing values (Column Auto Type Cast node)- Include all columns- Set the missing value pattern to "NA"2) Filter out columns that have more than 30% missing values (Missing Value Column Filter node)3) Set the remaining missing values to the column mean (numeric columns) or the most frequent value (String columns)4) Apply the same missing value handling to the test set (Reference Column Filter and Missing Value (Apply) nodes) Create Garage column ( Rule Enginenode) Handle missing values ( Missing Value node) -Remove row with string or integer missing values- Set "Lot frontage" and "Mas Vnr Area" missing values to 0- Impute "Garage Yr Blt" missing value to the column mean Optional:-Handle the NAs missing values (Column Auto Type Cast node)-Remove columns with more than 30% missing values (Missing Value ColumnFilter node) Handle remaining missing values ( Missing Valuenode) -Set the numeric missing values to the column mean and the stringto the most frequent value Apply the above missing value handling to the test set (Missing value apply and Rule Engine nodes) Remove the discarded columns from the test set (Reference column Filter node) Apply the same missing value handling to the test set (MissingValue (Apply) node) Read AmesHousing.csv CSV Reader Preprocessing Exercise: Missing Value Handling1) The "Garage Yr Blt" column (year when the garage was built) has many missing values. This means probably that these houses don't have a garage. Therefore:- Create a new feature "Garage" that gets the value 0 if the "Garage Yr Blt" feature is missing and 1 otherwise (Rule Engine node)2) The dataset contains at least two columns with many missing values: "Lot Frontage" (linear feet of street connected to the property) and "Mas Vnr Area" (masonry veneer area).- Set the missings values in these columns to zero. (Missing Value node)- Moreover, replace the missing values in the "Garage Yr Blt" column with the column mean- Other missing values are few. Remove the remaining rows containing missing values.3) Apply the same missing value handling to the test set (Missing Value (Apply) and Rule Engine nodes)Optional: Take a look, for example, at the columns "Pool QC", "Fence", and "Misc Feature". They contain many "NA" values. This is probably because these houses don't have a pool,fence, or another miscellaneous feature for which to report the quality. So let's also handle these "NA" values as missing values.1) Change "NAs" into missing values (Column Auto Type Cast node)- Include all columns- Set the missing value pattern to "NA"2) Filter out columns that have more than 30% missing values (Missing Value Column Filter node)3) Set the remaining missing values to the column mean (numeric columns) or the most frequent value (String columns)4) Apply the same missing value handling to the test set (Reference Column Filter and Missing Value (Apply) nodes) Create Garage column ( Rule Enginenode) Handle missing values ( Missing Value node) -Remove row with string or integer missing values- Set "Lot frontage" and "Mas Vnr Area" missing values to 0- Impute "Garage Yr Blt" missing value to the column mean Optional:-Handle the NAs missing values (Column Auto Type Cast node)-Remove columns with more than 30% missing values (Missing Value ColumnFilter node) Handle remaining missing values ( Missing Valuenode) -Set the numeric missing values to the column mean and the stringto the most frequent value Apply the above missing value handling to the test set (Missing value apply and Rule Engine nodes) Remove the discarded columns from the test set (Reference column Filter node) Apply the same missing value handling to the test set (MissingValue (Apply) node) Read AmesHousing.csv CSV Reader Preprocessing

Nodes

Extensions

Links