Icon

02_​Fraud_​Detection_​by_​Unsupervised_​Learning

Fraud Detection by Unsupervised Learning

This workflow reads in the creditcard.csv file and trains and evaluates an Isolation Forest model that detects fraudulent transactions as outliers. The H2O Isolation Forest Predictor node produces two columns that can be used to identify outliers: outlier score and mean length. Here we identify outliers based on the mean length, which is the average number of random splits required to isolate a data point from the other data points. The threshold for the mean length is optimized using a parameter optimization loop.







Exercise 2: Training an Unsupervised Learning Algorithm for Fraud DetectionIn this exercise we assume that no labeled data are available to train a model for fraud detection. Therefore, we apply anunsupervised method called Isolation Forest. Unlike other unsupervised learning algorithms for outlier detection, isolation forestidentifies the outliers instead of profiling the normal data points. The steps below cover data access, data preparation, modeltraining, and evaluation. Optionally, you can also optimize the threshold that determines how sensibly the model detects outliers.This threshold achieves the best performance for our fraud detection case.DataThe creditcard.csv dataset contains credit card transactions that were performed by European cardholders in September 2013.From the 284 807 transactions in the dataset 492 (0.2 %) are fraudulent and the rest are normal. Data Access1) Read "creditcard.csv"file available in the "Data"folder in the KNIMEExplorer (CSV Reader node)Hint! Drag and drop the datafile into the workflow editor2) Convert the "Class"column to string by doubleclicking the columnheader in theconfiguration dialog Data Preprocessing1) Separate fraudulent and legitimate transactionsinto two datasets (Nominal Value Row Splitter node)2) Separate 66% of the legitimate transactions asthe training set. Draw randomly. (Partitioning node)3) Make the test set by concatenating the rest ofthe legitimate transactions with the fraudulenttransactions (Concatenate node) Importing Data to H2O1) Create an H2O instance (H2O LocalContext node)2) Convert the training and test setsinto H2O data frames (Table to H2Onodes) Model Training1) Train an isolation forest model (H2O Isolation Forest Learner node)Imporant: Exclude the class column.2) Use the model to assign the mean length to each data point in the test set (H2OIsolation Forest Predictor node)Think of the mean length as the average number of random splits required to isolate thisdatapoint from the other data points. The smaller the value, the more probably the data point is anoutlier. 3) Convert the data from an H2O data frame to a KNIME table (H2O to KNIME node)4) Determine fraudulent transactions using the threshold value 6 for the meanlength (Rule Engine node) Model Evaluation1) Evaluate the performanceof the model bythe scoring metrics for aclassification model (Scorernode) Threshold Optimization (Optional)1) Optimize the threshold for the mean length using an optimization loop that maximizes the Cohen's kappa (Parameter Optimization Loop Startand Parameter Optimization Loop End nodes)- Change the classification threshold for each iteration starting with 4.5 and increasing the value by 0.1 until 6.5- Check the performances obtained using the different thresholds. What would be the optimal threshold? 2/3 of negativesfor training1/3 of negativesand all positivesfor validationRead creditcard.csvCreate H2O FrameCreate H2O FrameCreate KNIMETableApply IsolationForest to get Mean Length Classify Based onMean LengthTop: 0Optimize thresholdCollect Cohen's kappafor each thresholdEvaluate Partitioning Concatenate CSV Reader Table to H2O Table to H2O H2O to Table H2O IsolationForest Learner H2O IsolationForest Predictor Rule Engine H2O Local Context Nominal ValueRow Splitter Parameter OptimizationLoop Start ParameterOptimization Loop End Scorer Exercise 2: Training an Unsupervised Learning Algorithm for Fraud DetectionIn this exercise we assume that no labeled data are available to train a model for fraud detection. Therefore, we apply anunsupervised method called Isolation Forest. Unlike other unsupervised learning algorithms for outlier detection, isolation forestidentifies the outliers instead of profiling the normal data points. The steps below cover data access, data preparation, modeltraining, and evaluation. Optionally, you can also optimize the threshold that determines how sensibly the model detects outliers.This threshold achieves the best performance for our fraud detection case.DataThe creditcard.csv dataset contains credit card transactions that were performed by European cardholders in September 2013.From the 284 807 transactions in the dataset 492 (0.2 %) are fraudulent and the rest are normal. Data Access1) Read "creditcard.csv"file available in the "Data"folder in the KNIMEExplorer (CSV Reader node)Hint! Drag and drop the datafile into the workflow editor2) Convert the "Class"column to string by doubleclicking the columnheader in theconfiguration dialog Data Preprocessing1) Separate fraudulent and legitimate transactionsinto two datasets (Nominal Value Row Splitter node)2) Separate 66% of the legitimate transactions asthe training set. Draw randomly. (Partitioning node)3) Make the test set by concatenating the rest ofthe legitimate transactions with the fraudulenttransactions (Concatenate node) Importing Data to H2O1) Create an H2O instance (H2O LocalContext node)2) Convert the training and test setsinto H2O data frames (Table to H2Onodes) Model Training1) Train an isolation forest model (H2O Isolation Forest Learner node)Imporant: Exclude the class column.2) Use the model to assign the mean length to each data point in the test set (H2OIsolation Forest Predictor node)Think of the mean length as the average number of random splits required to isolate thisdatapoint from the other data points. The smaller the value, the more probably the data point is anoutlier. 3) Convert the data from an H2O data frame to a KNIME table (H2O to KNIME node)4) Determine fraudulent transactions using the threshold value 6 for the meanlength (Rule Engine node) Model Evaluation1) Evaluate the performanceof the model bythe scoring metrics for aclassification model (Scorernode) Threshold Optimization (Optional)1) Optimize the threshold for the mean length using an optimization loop that maximizes the Cohen's kappa (Parameter Optimization Loop Startand Parameter Optimization Loop End nodes)- Change the classification threshold for each iteration starting with 4.5 and increasing the value by 0.1 until 6.5- Check the performances obtained using the different thresholds. What would be the optimal threshold? 2/3 of negativesfor training1/3 of negativesand all positivesfor validationRead creditcard.csvCreate H2O FrameCreate H2O FrameCreate KNIMETableApply IsolationForest to get Mean Length Classify Based onMean LengthTop: 0Optimize thresholdCollect Cohen's kappafor each thresholdEvaluate Partitioning Concatenate CSV Reader Table to H2O Table to H2O H2O to Table H2O IsolationForest Learner H2O IsolationForest Predictor Rule Engine H2O Local Context Nominal ValueRow Splitter Parameter OptimizationLoop Start ParameterOptimization Loop End Scorer

Nodes

Extensions

Links