Icon

01_​Model_​Selection_​Sampled

Model Selection to predict Death Occurrences in Car Accidents

This workflow trains a few data analytics models and automatically selects the best one to predict death in car accidents. Data has been sub-sampled to allow the workflow execution also on the least equipped machines. Sub-sampling is in metanode Reading Data/Pre-processing and can be removed to make the workflow run on all data.

Reading Data - accident, - vehicle, - person - selected years Dataset Evaluation - 10-fold Cross-Validation - stddev and mean of error - if stddev/mean < 1.0 => GO! - else "dataset not valid!" Visual Investigationscatter plots (driver height vs. weight) and bar chart, statistics (notice: car model,latitude, car owner, and class), linear correlation: suspicious correlation betweenHISPANIC and INJ_SEV Error message in case dataset isnot general enough! Dimensionality Reduction - if % missing values > 90% => remove column - if variance < 0.005 => remove column - if a pair of columns are highly correlated =>remove one of the twoPCA not used because of interpretability loss Model SelectionSelect the best model in terms of AuC among:Random Forest, my own ensemble model (NaiveBayes, logit, dec.tree), decision tree from R, thecurrent model, and optionally ANN and k-Means The decision tree from R is a linked metanode. Display ResultsExport message (error orsuccess) and display onWebPortal Model Selection to Predict Death Occurrences in Car AccidentsThis workflow trains a few data analytics models and automatically selects the best one to predict death in car accidents.Data has been sub-sampled here to allow the workflow execution also on the least equipped machines. Sub-sampling is in metanode Reading Data/Pre-processing and can be removed tomake the workflow run on all data. check for too high correlation with class and INJ_SEVrm columnsclass as StringNode 393Node 394Node 398Node 401check qualityof dataset% missing valueslow variancehigh correlationNode 407Node 408Node 410Node 411reading tables accident, vehicle, person for selected yearsdriver height vs. weight Node 417Node 420basic general statsNode 422Node 423Node 427 Linear Correlation Pre-processing CASE SwitchData (Start) CASE SwitchData (End) Text Output Table Columnto Variable Dataset Evaluationthrough X-validation DimensionalityReduction Bag of Models Prepare ErrorMessage Clustering Prepare Message Reading Data JavaScriptScatter Plot Color Manager Remove Outliers Statistics JavaScriptBar Chart JavaScriptROC Curve Sampling Reading Data - accident, - vehicle, - person - selected years Dataset Evaluation - 10-fold Cross-Validation - stddev and mean of error - if stddev/mean < 1.0 => GO! - else "dataset not valid!" Visual Investigationscatter plots (driver height vs. weight) and bar chart, statistics (notice: car model,latitude, car owner, and class), linear correlation: suspicious correlation betweenHISPANIC and INJ_SEV Error message in case dataset isnot general enough! Dimensionality Reduction - if % missing values > 90% => remove column - if variance < 0.005 => remove column - if a pair of columns are highly correlated =>remove one of the twoPCA not used because of interpretability loss Model SelectionSelect the best model in terms of AuC among:Random Forest, my own ensemble model (NaiveBayes, logit, dec.tree), decision tree from R, thecurrent model, and optionally ANN and k-Means The decision tree from R is a linked metanode. Display ResultsExport message (error orsuccess) and display onWebPortal Model Selection to Predict Death Occurrences in Car AccidentsThis workflow trains a few data analytics models and automatically selects the best one to predict death in car accidents.Data has been sub-sampled here to allow the workflow execution also on the least equipped machines. Sub-sampling is in metanode Reading Data/Pre-processing and can be removed tomake the workflow run on all data. check for too high correlation with class and INJ_SEVrm columnsclass as StringNode 393Node 394Node 398Node 401check qualityof dataset% missing valueslow variancehigh correlationNode 407Node 408Node 410Node 411reading tables accident, vehicle, person for selected yearsdriver height vs. weight Node 417Node 420basic general statsNode 422Node 423Node 427 Linear Correlation Pre-processing CASE SwitchData (Start) CASE SwitchData (End) Text Output Table Columnto Variable Dataset Evaluationthrough X-validation DimensionalityReduction Bag of Models Prepare ErrorMessage Clustering Prepare Message Reading Data JavaScriptScatter Plot Color Manager Remove Outliers Statistics JavaScriptBar Chart JavaScriptROC Curve Sampling

Nodes

Extensions

Further Links