Icon

01.01 Decision Tree exercise

<p>01.01 Decision Tree exercise<br><br>[L4-ML] Machine Learning Algorithms - Specialization</p><p>01 Classification Models</p><p>- Partition data into training and test set<br>- Train a Decision Tree model<br>- Apply the model to the test set<br>- Evaluate performances</p>

URL: Description of the Ames Iowa Housing Data https://rdrr.io/cran/AmesHousing/man/ames_raw.html
URL: Ames Housing Dataset on kaggle https://www.kaggle.com/prevek18/ames-housing-dataset

01 - Classification Models

01.01 Decision Tree

Learning objective: In this exercise, you'll learn how to train a binary classification model to predict whether the overall condition is high or low, using a node to evaluate the model's performance.


Workflow description: This workflow uses a dataset that describes the sale of individual residential properties in Ames, Iowa from 2006 to 2010. One of the columns is the overall condition ranking, with values between 1 and 10.
The goal of this exercise is to train a binary classification model, which can predict whether the overall condition is high or low. To do so, the workflow below reads the data set and creates the class column based on overall condition ranking, which is called rank and has the values low if the overall condition is smaller or equal to 5, otherwise high.


You'll find the instructions to the exercises in the yellow annotations.

Step 1. Partitioning

Utilize the Partitioning node to divide the data into training (70%) and test sets (30%). Specifically, employ stratified sampling based on the column rank to preserve the distribution of class values in both output tables.


Step 2. Decision Tree Learner

Train a Decision Tree model (using the Decision Tree Learner node) to predict the overall condition of a house as either high or low. Choose the rank column as the class column.


Step 3. Decision Tree Predictor

Utilize the trained model to predict the rank of houses in the test set using the Decision Tree Predictor node.


Data Preparation

Data Preparation

Step 1. Partitioning

Utilize the Partitioning node to divide the data into training (70%) and test sets (30%). Specifically, employ stratified sampling based on the column rank to preserve the distribution of class values in both output tables.


Step 2. Random Forest Learner

Train a Random Forest model to predict the overall condition of a house as either high or low. Choose the rank column as the class column.


Step 3. Random Forest Predictor

Utilize the trained model to predict the rank of houses in the test set using the Random Forest Predictor node.


Step 4. Model evaluation

  1. Evaluate the accuracy of the decision tree model using the Scorer node. Select rank as the actual column and Prediction (rank) as the predicted column. Determine and report the accuracy of the model.

  2. Visualize the ROC curve using the ROC Curve node. Ensure that the checkbox "append columns with normalized class distribution" in the Decision Tree Predictor node is activated. Select rank as the Class column, set High as the Positive class value, and include only the P (rank=High) column.

  3. Optional: Try different setting options for the decision tree algorithm. Can you improve the model performance?


Step 4. Model evaluation

  1. Evaluate the accuracy of the Random Forest model


RANDOM FOREST

Data Preparation

Step 1. Partitioning

Utilize the Partitioning node to divide the data into training (70%) and test sets (30%). Specifically, employ stratified sampling based on the column rank to preserve the distribution of class values in both output tables.


Step 2. XGBoost Learner

Train a XGBoost model to predict the overall condition of a house as either high or low. Choose the rank column as the class column.


Step 3 .XGBoost Predictor

Utilize the trained model to predict the rank of houses in the test set using the XGBoost Predictor node.


XGBOOST

Step 4. Model evaluation

  1. Evaluate the accuracy of the XGBoost model


Extract Class Information
Table Partitioner
XGBoost Tree Ensemble Learner
Decision Tree Learner
XGBoost Predictor
Read AmesHousing.csv
CSV Reader
Read AmesHousing.csv
CSV Reader
Table Partitioner
Table Partitioner
Random Forest Learner
Decision Tree Predictor
Extract Class Information
82.4%
Scorer
81.8%%
Scorer
Read AmesHousing.csv
CSV Reader
PMML Writer
Random Forest Predictor
Extract Class Information
75.9%
Scorer

Nodes

Extensions

Links