Icon

01_​Fraud_​Detection_​by_​Supervised_​Learning

Fraud Detection by Supervised Learning

This workflow reads in the creditcard.csv file and trains and evaluates a Logistic Regression and a Random Forest model to classify transactions as either fraudulent or not. Notice the final Rule Engine node. This node classifies all transactions with a fraud probability greater than 0.3 as fraudulent. The classification threshold is optimized using a parameter optimization loop.





Exercise 1: Training a Supervised Learning Algorithm for Fraud DetectionIn this exercise we assume that labeled data are available to train a classification model. The steps below cover dataaccess, data preparation, model training and evaluation. Optionally, you can also train different algorithms, comparetheir performances, and optimize the model parameters.DataThe creditcard.csv dataset contains credit card transactions that were performed by European cardholders inSeptember 2013. From the 284 807 transactions in the dataset 492 (0.2 %) are fraudulent and the rest are normal. Data Access1) Read "creditcard.csv" file availablein the "Data" folder in the KNIMEExplorer (CSV Reader node) Hint! Drag and drop the data file into theworkflow editor2) Convert the "Class" column tostring in the Transformation tab Data Preprocessing1) Handle missing values (Missing Value node): - Set missing numeric values to 0- Set missing string values to "unknown"- Remove rows that have missing values in the "Class"column2) Standardize numeric columns (Normalizer node) Partitioning1) Separate the first 70 %of the data to the trainingset and the remaining 30% to the test set(Partitioning node) Model Training1) Train a logistic regression model to predict the "Class"column (Logistic Regression Learner node)- Solver: Stochastic average gradient- Epochs: 200, Epsilon: 1.0E-4- Learning Rate Strategy: LineSearch- Regularization: Gauss2) Predict the classes and class probabilities by themodel (Logistic Regression Predictor node)- Check "Append columns with predicted probabilities"3) Change the classification threshold to 0.3 and re-assign the classes (Rule Engine node) Model Evaluation1) Display the performance metrics of the model with the classificationthreshold 0.5 (Scorer (JavaScript) node)2) Display the performance metrics of the model with the classificationthreshold 0.3 (Scorer (JavaScript) node)3) Encapsulate the Scorer (JavaScript) nodes into a component and comparethe performances in the component's interactive view Model Training (Optional)1) Train a random forest model to predict the "Class"column (Random Forest Learner node)2) Predict the classes and class probabilities by themodel (Random Forest Predictor node) using1. the default threshold 0.52. the threshold 0.33) Evaluate the performance of the Random Forestmodel with the component from the previous step. Whichmodel performs better? Classification Threshold Optimization (Optional)1) Evaluate the performance of the model with the custom classificationthreshold (Scorer node)Hint! The Scorer node produces the evaluation metrics as flow variables and is thereforeused instead of the Scorer (JavaScript) node 2) Optimize the classification threshold using an optimization loop thatmaximizes the Cohen's kappa (Parameter Optimization Loop Start and ParameterOptimization Loop End nodes)- Change the custom classification threshold for each iteration starting with 0 and increasingthe value by 0.1 until 1- Check the performances obtained using the different classification thresholds. What wouldbe the optimal classification threshold? 70 % Training30 % TestingP(Class =1)>0.3=>1P(Class=1)<= 1Z-ScoreP(Class =1)>0.3=>1P(Class=1)<= 1Varying classificationthresholdRead creditcard.csv Partitioning Rule Engine Evaluation Missing Value Normalizer LogisticRegression Learner Logistic RegressionPredictor Random ForestLearner Random ForestPredictor Rule Engine Evaluation Rule Engine Parameter OptimizationLoop Start ParameterOptimization Loop End CSV Reader Scorer Exercise 1: Training a Supervised Learning Algorithm for Fraud DetectionIn this exercise we assume that labeled data are available to train a classification model. The steps below cover dataaccess, data preparation, model training and evaluation. Optionally, you can also train different algorithms, comparetheir performances, and optimize the model parameters.DataThe creditcard.csv dataset contains credit card transactions that were performed by European cardholders inSeptember 2013. From the 284 807 transactions in the dataset 492 (0.2 %) are fraudulent and the rest are normal. Data Access1) Read "creditcard.csv" file availablein the "Data" folder in the KNIMEExplorer (CSV Reader node) Hint! Drag and drop the data file into theworkflow editor2) Convert the "Class" column tostring in the Transformation tab Data Preprocessing1) Handle missing values (Missing Value node): - Set missing numeric values to 0- Set missing string values to "unknown"- Remove rows that have missing values in the "Class"column2) Standardize numeric columns (Normalizer node) Partitioning1) Separate the first 70 %of the data to the trainingset and the remaining 30% to the test set(Partitioning node) Model Training1) Train a logistic regression model to predict the "Class"column (Logistic Regression Learner node)- Solver: Stochastic average gradient- Epochs: 200, Epsilon: 1.0E-4- Learning Rate Strategy: LineSearch- Regularization: Gauss2) Predict the classes and class probabilities by themodel (Logistic Regression Predictor node)- Check "Append columns with predicted probabilities"3) Change the classification threshold to 0.3 and re-assign the classes (Rule Engine node) Model Evaluation1) Display the performance metrics of the model with the classificationthreshold 0.5 (Scorer (JavaScript) node)2) Display the performance metrics of the model with the classificationthreshold 0.3 (Scorer (JavaScript) node)3) Encapsulate the Scorer (JavaScript) nodes into a component and comparethe performances in the component's interactive view Model Training (Optional)1) Train a random forest model to predict the "Class"column (Random Forest Learner node)2) Predict the classes and class probabilities by themodel (Random Forest Predictor node) using1. the default threshold 0.52. the threshold 0.33) Evaluate the performance of the Random Forestmodel with the component from the previous step. Whichmodel performs better? Classification Threshold Optimization (Optional)1) Evaluate the performance of the model with the custom classificationthreshold (Scorer node)Hint! The Scorer node produces the evaluation metrics as flow variables and is thereforeused instead of the Scorer (JavaScript) node 2) Optimize the classification threshold using an optimization loop thatmaximizes the Cohen's kappa (Parameter Optimization Loop Start and ParameterOptimization Loop End nodes)- Change the custom classification threshold for each iteration starting with 0 and increasingthe value by 0.1 until 1- Check the performances obtained using the different classification thresholds. What wouldbe the optimal classification threshold? 70 % Training30 % TestingP(Class =1)>0.3=>1P(Class=1)<= 1Z-ScoreP(Class =1)>0.3=>1P(Class=1)<= 1Varying classificationthresholdRead creditcard.csv Partitioning Rule Engine Evaluation Missing Value Normalizer LogisticRegression Learner Logistic RegressionPredictor Random ForestLearner Random ForestPredictor Rule Engine Evaluation Rule Engine Parameter OptimizationLoop Start ParameterOptimization Loop End CSV Reader Scorer

Nodes

Extensions

Links