Icon

03_​Outlier_​Detection_​solution

Outlier Detection - solution

Introduction to Machine Learning Algorithms course - Session 4
Solution to exercise 3
Detect and remove outliers in the data using the following techniques:
- Numeric outliers outside the upper/lower whiskers of a box plot
- Outliers in the distribution tails (z-score)
- Outliers remote from cluster centers (DBSCAN)




Exercise: Outlier DetectionSome houses might be special cases in terms of size, price, and the year when they were sold or built. Let's clean the data from thesehouses in order to build a better model!1) Remove houses that have a sales price lying outside the interquartile range of all sales prices (Numeric Outlier and node)- Select the "SalePrice" column- Set the interquartile range parameter to 1.52) Optional: Remove the 5 % of the houses that are the most extreme in terms of size (Normalizer and Rule-based Row Filter nodes)- Normalize the "Lot Area" column using z-score- Filter out houses whose normalized lot size is outside the range [-1.96, 1.96] Numeric Outliers Optional: Outliers in Distribution Tails Remove numeric outliersin SalePricez-score+/- 1.96 as thresholdapply to test setapply to test setRead AmesHousing.csv+/- 1.96 as threshold Numeric Outliers Missing ValueHandling Normalizer Rule-basedRow Filter Normalizer (Apply) Numeric Outliers(Apply) Preprocessing CSV Reader Rule-basedRow Filter Exercise: Outlier DetectionSome houses might be special cases in terms of size, price, and the year when they were sold or built. Let's clean the data from thesehouses in order to build a better model!1) Remove houses that have a sales price lying outside the interquartile range of all sales prices (Numeric Outlier and node)- Select the "SalePrice" column- Set the interquartile range parameter to 1.52) Optional: Remove the 5 % of the houses that are the most extreme in terms of size (Normalizer and Rule-based Row Filter nodes)- Normalize the "Lot Area" column using z-score- Filter out houses whose normalized lot size is outside the range [-1.96, 1.96] Numeric Outliers Optional: Outliers in Distribution Tails Remove numeric outliersin SalePricez-score+/- 1.96 as thresholdapply to test setapply to test setRead AmesHousing.csv+/- 1.96 as threshold Numeric Outliers Missing ValueHandling Normalizer Rule-basedRow Filter Normalizer (Apply) Numeric Outliers(Apply) Preprocessing CSV Reader Rule-basedRow Filter

Nodes

Extensions

Links