Icon

04_​Machine Learning - Exercise

Session 4 - Machine Learning

Learning objective: In this exercise you will train and evaluate supervised and unsupervised machine learning models.


Workflow description:

The same customer transaction data set from the demo is used in this exercise, with some additional features on the loyalty program membership information.

  1. The Loading Customer Data metanode reads the customer information as well as the loyalty program membership information.

  2. Training a classification model to predict the membership status (Member or Gold). A decision tree model is trained, and its performance is evaluated. (Activity I)

  3. Training a linear regression model to predict the Total spending with the number of purchased items and the loyalty program points. The goodness-of-fit of the trained model is then evaluated. (Activity II)

  4. Grouping customers with similar characteristics into clusters by k-Means clustering. Visualizing the resulting clusters. (Activity III)

Loading the customer data

The metanode Loading Customer Data reads the customer transaction data and the membership data from two sources. The output data set contains the customer information

  • n_items: number of unique items purchased

  • Total spending

  • City

  • Newsletter: newsletter subscription

  • Customer group: CC1, CC2, or CC3

  • Status: membership status, either Member or Gold

  • Loyalty points



Activity I: Classification of the Membership Status

  1. From the output data of the Loading Customer Data metanode, remove the column CustomerID with a Column Filter node

  2. Partition the data into two partitions: training (70%) and testing (30%) with a Partitioning node. Stratify the sampling according to the classes in the Status column.

  3. Train a decision tree model to classify the Status with a Decision Tree Learner node. Set the Min number records per node to 5.

  4. Apply the trained decision tree model to the testing data with a Decision Tree Predictor node. Check Append columns with normalized class distribution.

  5. Evaluate the classification performance with a Scorer node, with Status as the First Column and Prediction (Status) as the Second Column.

  6. Plot an ROC curve from the prediction results with a ROC Curve node. The Target column is Status, Positive class value is Gold, and plot P (Status=Gold).



Activity II: Regression Model to Predict Total Spending

  1. Use the output data of the Loading Customer Data metanode and partition into two partitions: training (70%) and testing (30%) with a Partitioning node. Select Draw randomly.

  2. Train a linear regression model with a Linear Regression Learner node. Predict Total spending by n_items and Loyalty points. Open the view and examine the parameter estimates from the trained model.

  3. Apply the trained regression model to the testing data with a Regression Predictor node.

  4. Evaluate the goodness-of-fit of the regression model with a Numeric Scorer node, with Total spending as the Reference column and Prediction (Total spending) as the Predicted column.



Activity III: Grouping Customers with Similar Characteristics by Clustering

  1. In the output data of the Loading Customer Data metanode, normalize numerical columns with a Normalizer node. Use the Min-Max Normalization with Min=0.0 and Max=1.0.

  2. Perform k-means clustering with a k-Means node. Set the Number of clusters to 3. Use all the numerical columns (n_items, Total spending, & Loyalty points) for clustering.

  3. Visualize the clusters discovered in the previous step.

    • With a Color Manager node, assign different colors to different classes of the column Cluster.

    • Generate a scatter plot with a Scatter Plot node, with n_items on the x-axis and Loyalty points on the y-axis. Use Cluster as the Color dimension.



Linear Regression Learner
Numeric Scorer
Regression Predictor
k-Means
Normalizer
Scatter Plot
Color Manager
Table Partitioner
Loading Customer Data

Nodes

Extensions

Links