Icon

02_​Outlier_​Detection

There has been no title set for this workflow's metadata.

Outlier Detection - exercise

Introduction to Machine Learning Algorithms course - Session 4
Exercise 3
Detect and remove outliers in the data using the following techniques:
- Numeric outliers outside the upper/lower whiskers of a box plot
- Outliers in the distribution tails (z-score)
- Outliers remote from cluster centers (DBSCAN)

URL: Ames Housing Dataset on kaggle https://www.kaggle.com/prevek18/ames-housing-dataset
URL: Description of the Ames Iowa Housing Data https://rdrr.io/cran/AmesHousing/man/ames_raw.html
URL: Four Techniques for Outlier Detection https://www.knime.com/blog/four-techniques-for-outlier-detection
URL: Slides (Introduction to ML Algorithms course) https://www.knime.com/form/material-download-registration

Exercise 02 Outlier Detection Learning objective: In this exercise you learn how to clean the data to build a better model, in particular by applying oulier detection techniques.Workflow description: In this workflow, outlier detection will be implemented in both the training and test datasets, employing numeric outlier identification. Additionally, there is an option to normalize the data using the z-score.You'll find the instructions to the exercises in the yellow annotations. Step 1. Numeric OutliersRemove houses that have a sales price lying outside the interquartile range of allsales prices (Numeric Outlier and node)Select the SalePrice columnSet the interquartile range parameter to 1.5 Step 2. Outliers in Distribution Tails (Optional)Remove the 5 % of the houses that are the most extreme in terms of size(Normalizer and Rule-based Row Filter nodes)Normalize the Lot Area column using z-scoreFilter out houses whose normalized lot size is outside the range [-1.96, 1.96] Data Preparation Read AmesHousing.csvNode 783Node 784Node 785Node 786Node 787 Missing ValueHandling Preprocessing CSV Reader Numeric Outliers Normalizer Rule-basedRow Filter Normalizer (Apply) Rule-basedRow Filter Exercise 02 Outlier Detection Learning objective: In this exercise you learn how to clean the data to build a better model, in particular by applying oulier detection techniques.Workflow description: In this workflow, outlier detection will be implemented in both the training and test datasets, employing numeric outlier identification. Additionally, there is an option to normalize the data using the z-score.You'll find the instructions to the exercises in the yellow annotations. Step 1. Numeric OutliersRemove houses that have a sales price lying outside the interquartile range of allsales prices (Numeric Outlier and node)Select the SalePrice columnSet the interquartile range parameter to 1.5 Step 2. Outliers in Distribution Tails (Optional)Remove the 5 % of the houses that are the most extreme in terms of size(Normalizer and Rule-based Row Filter nodes)Normalize the Lot Area column using z-scoreFilter out houses whose normalized lot size is outside the range [-1.96, 1.96] Data Preparation Read AmesHousing.csvNode 783Node 784Node 785Node 786Node 787Missing ValueHandling Preprocessing CSV Reader Numeric Outliers Normalizer Rule-basedRow Filter Normalizer (Apply) Rule-basedRow Filter

Nodes

Extensions

Links