Icon

Session4

Example I (Bonus): Random Forest Model to Predict Which Customers Receive Discount (Yes/No) with a Parameter Optimization Loop

Example III: K-Means Clustering to Identify Clusters of Data with Similar Feature Values

Example II: Linear Regression Model to Predict the Total Spending by the Number of Purchased Items

Example I: Decision Tree Model to Predict Which Customers Receive Discount (Yes/No)

L2-DS Demo Workflow for Session 4

Classification, Regression, and Clustering

Learning objectives:

  • Train supervised ML models

  • Evaluate supervised ML models

  • Train unsupervised ML models


Workflow description: A data set with customer transaction information is used in these examples.

In the first example, a decision tree model is trained to classify which customers have received discount. The performance of the trained model is then evaluated.

In the second example, a regression model is trained to describe the customer total spending by the number of unique items purchased. The performance of the trained model is then evaluated.

In the third example, data with similar features are grouped into clusters by k-Means algorithm.

Loading the customer data

The metanode Loading Customer Data reads the customer transaction data and the membership data from two sources. The output data set contains the customer information

  • n_items: number of unique items purchased

  • Total spending

  • City

  • Newsletter: newsletter subscription

  • Customer group: CC1, CC2, or CC3

  • Discount: whether the customer received discount

  • Credit risk

  • Churn score



Removing unnecessary column

With a Column Filter node, remove the unnecessary column CustomerID from the data table



Training & testing data

With a Partitioning node, the data set is split into the training & testing partitions, comprising 70% & 30% of the original data, respectively. The partitions are stratified by the target column Discount, so that the proportions of the target classes (Yes/No) are preserved in the training & testing partitions.



Training a decision tree model

Train a decision tree model with the Decision Tree Learner node with the following settings:

  • Input data: Training data (top output of the Partitioning node)

  • Class column: Discount

  • Min number records per node: 5



Prediction on the testing data

Apply the trained decision tree model to the testing data with the Decision Tree Predictor node with the following settings:

  • Input model: output of the Decision Tree Learner node

  • Input data: Testing data (bottom output of the Partitioning node)

  • Check Append columns with normalized class distribution



Evaluating the model fit

Evaluate the performance of the trained model by comparing the true target and the predicted outcome in the Scorer node.

  • Input data: prediction from the Decision Tree Predictor node

  • First column: Discount

  • Second column: Prediction (Discount)

Open the view to see the confusion matrix and model performance metrics (accuracy, etc)



Plotting an ROC curve

Plot an ROC curve to evaluate the performance of the model with the ROC Curve node/

  • Input data: prediction from the Decision Tree Predictor node

  • Target column: Discount

  • Positive class value: Yes

  • Include the column P (Discount=Yes) to be plotted

Open the view to see the ROC curve and the area under the curve



Partitioning

With a Partitioning node, the data set is split into the training & testing partitions, comprising 70% & 30% of the original data, respectively. We set Draw randomly to randomly assign data to either partition.



Training a linear regression model

Train a linear regression model with the Linear Regression Learner node with the following settings:

  • Input data: Training data (top output of the Partitioning node)

  • Target:Total spending

  • n_Times is included as the predictor



Prediction on the testing data

Apply the trained regression model to the testing data with the Regression Predictor node with the following settings:

  • Input model: output of the Linear Regression Learner node

  • Input data: Testing data (bottom output of the Partitioning node)



Evaluating the model fit

Evaluate the performance of the trained model by comparing the true target and the predicted outcome in the Numeric Scorer node.

  • Input data: prediction from the Regression Predictor node

  • Reference column: Total spending

  • Predicted column: Prediction (Total spending)

Open the view to see the model performance metrics (R^2, MSE, etc)



Normalizing numerical columns

With a Normalizer node, all the numerical columns are scaled to [0,1] range.



k-Means clustering

Cluster data points into clusters with a k-Means node with the following setting:

  • Number of clusters: 3

  • Features to include: n_Items, Total spending, Credit risk, Churn score

Cluster assignments are denoted in the new column Cluster in the output data set.



Visualizing clusters

The clusters found in the previous step are visualized in a scatter plot.

  1. The Color Manager node assigns different colors to different classes in the the column Cluster.

  2. In the Scatter Plot node, Credit risk (x-axis) is plotted against Churn score (y-axis), with Cluster as the color dimension.



Visualizing clusters 2

The clusters found in the previous step are visualized in a scatter plot.

  1. The Color Manager node assigns different colors to different classes in the the column Customer group.

  2. In the Scatter Plot node, Credit risk (x-axis) is plotted against Churn score (y-axis), with Customer group as the color dimension.



Training a random forest model

Train a random forest with the Random Forest Learner node with the following settings:

  • Flow variable connection from the Parameter Optimization Loop Start node

  • The maximum tree depth (maxLevels): max_depth

  • The number of models (nrModels): n_trees



Parameters to be optimized

A loop start with the Parameter Optimization Loop Start node, setting the following parameters:

  • max_depth: the maximum tree depth, integer between 2 & 10 with step size 1

  • n_trees: number of trees, integer between 10 & 100 with step size 10

  • Search strategy: Brute Force



Prediction on the testing data

Apply the trained decision tree model to the testing data with the Random Forest Predictor node with the following settings:

  • Input model: output of the Random Forest Learner node

  • Input data: Testing data (bottom output of the Partitioning node)



Evaluating the model fit

Evaluate the performance of the trained model by comparing the true target and the predicted outcome in the Scorer node.

  • Input data: prediction from the Decision Tree Predictor node

  • First column: Discount

  • Second column: Prediction (Discount)

Open the view to see the confusion matrix and model performance metrics (accuracy, etc)



Recording the parameters

The Parameter Optimization Loop End node records the performance metrics for various parameter values

  • Flow variable connection from the Scorer node

  • We want to maximize the Accuracy

At the top output port, the optimal setting with the optimal accuracy is output as a table. At the bottom output port, the table is provided listing all parameter combinations examined as well as their accuracy.



Decision Tree Predictor
Assessing the model fit
Scorer
Training:Testing70:30
Table Partitioner
Assigning differentcolors to customer groups
Color Manager
Training random forest model
Random Forest Learner
Plotting ROC curve
ROC Curve
Setting parameters to be optimizedas flow variables
Parameter Optimization Loop Start
Applying the trainedrandom forest model
Random Forest Predictor
Removing Customer ID
Column Filter
Applying the modelto testing data
Regression Predictor
Training a linear regression model
Linear Regression Learner
Training decision tree model
Decision Tree Learner
Assessing themodel fit
Numeric Scorer
Assessing the model fit
Scorer
Assigning differentcolors to clusters
Color Manager
K-means clusteringwith 3 clusters
k-Means
Scatter plot ofCredit risk vsChurn score
Scatter Plot
Recording parameter combinations and accuracy
Parameter Optimization Loop End
Scatter plot ofCredit risk vsChurn score
Scatter Plot
Training:Testing70:30Stratified on discount
Table Partitioner
Loading Customer Data
Normalizingnumericalcolumns
Normalizer

Nodes

Extensions

Links