Icon

01_​Analytics

Analytics - Model Selection to Predict Flight Departure Delays
Read & Blend Data Model Selection to Predict Flight Departure Delays This workflow trains a number of data analytics models and automatically selects the best model to predict departure delays from a selected airport.Data is the airline dataset downloadable from: http://stat-computing.org/dataexpo/2009/the-data.html. Departure delay is a delay > 15min. Default selected airport is ORD.This workflow shows data reading, data blending, ETL, guided analytics, dimensionality reduction, advanced data mining models, and model selection. Advanced ETL Functionality & Machine Learning-based Pre-processing - Outlier Detection - Dimensionality Reduction - Feature Generation - Missing Values - Discretization - Normalization - Automatic Dimensionality Reduction (SVD, PCA) - Machine Learning for Feature Selection Warning! The dataset used here is just a subset of the original dataset.Therefore final model performances will be different than what reported in videos https://youtu.be/IEAsUTN8q68 and https://youtu.be/rvTHhgCKQiwThe original datasets can be downloaded from The full datasets can be found under the following links:- airline dataset: http://stat-computing.org/dataexpo/2009/the-data.html- calender and weather information: https://developers.google.com/google-apps/calendar/ https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn Pre-Processing & Data PartitioningThis workflow covers feature generation, handling of missing values, discretization of nominal columns, partitioning of data into training and test sets, and datanormalization. - Feature Generation - Partitioning - Binning (on distance) - By year (2007 for training, 2008 for testing) - Math Formula (to extract hours only) - Using a Partitioning node - Rule Engines (to calculate daily segments) - GroupBy (operational features) - Missing Values and Discretization - Normalization - In the dependent variable - In the independent variables Automatic Dimensionality Reduction and Feature Selection This sub-workflow shows many techniques to reduce dimensionality - Automatic dimensionality reduction methods: - Principal Component Analysis (PCA) (KNIME) - Singular Value Decomposition (SVD) (Python) - Machine learning-based methods to select the most informative features: - Backward Feature Elimination - Forward Feature Selection - Random Forest statistics - Machine learning-based methods to reduce the number of data rows - k-Means - k-NN PCs will be used to train a bag of models. We could have used the SVD reduced outputs or the outputs of any of the other feature selection methods too. Data Row Reduction with PCAPCA (Principal Component Analysis) is a statistical procedure that uses an orthogonaltransformation to convert a set of observations of possibly correlated variables into a set ofvalues of linearly uncorrelated variables called principal components. Further explanation of PCA can be found here: https://en.wikipedia.org/wiki/Principal_component_analysis Training predicting/testing writeonly if 90 < accuracy < 100 or the like? Node 389Node 595Node 596Node 597Node 598top: 2007 (training)bottom: 2008 (testing)Node 737Node 742group: carrierNode 756preprocessfiltercarrierspredict foreach carriername for WF... for each carrierNode 768Normalizer (PMML) CaptureWorkflow Start CaptureWorkflow End Workflow Combiner Workflow Writer Advanced ETL & MLbased Pre-processing Read blended data Row Splitter PCA Compute CaptureWorkflow End PCA Apply ReferenceColumn Filter Group Loop Start sort out irrelevant DecisionTree Learner PCA Apply Row Filter Decision TreePredictor String Manipulation(Variable) Loop End CaptureWorkflow Start Keep only >2rows per carrier Read & Blend Data Model Selection to Predict Flight Departure Delays This workflow trains a number of data analytics models and automatically selects the best model to predict departure delays from a selected airport.Data is the airline dataset downloadable from: http://stat-computing.org/dataexpo/2009/the-data.html. Departure delay is a delay > 15min. Default selected airport is ORD.This workflow shows data reading, data blending, ETL, guided analytics, dimensionality reduction, advanced data mining models, and model selection. Advanced ETL Functionality & Machine Learning-based Pre-processing - Outlier Detection - Dimensionality Reduction - Feature Generation - Missing Values - Discretization - Normalization - Automatic Dimensionality Reduction (SVD, PCA) - Machine Learning for Feature Selection Warning! The dataset used here is just a subset of the original dataset.Therefore final model performances will be different than what reported in videos https://youtu.be/IEAsUTN8q68 and https://youtu.be/rvTHhgCKQiwThe original datasets can be downloaded from The full datasets can be found under the following links:- airline dataset: http://stat-computing.org/dataexpo/2009/the-data.html- calender and weather information: https://developers.google.com/google-apps/calendar/ https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn Pre-Processing & Data PartitioningThis workflow covers feature generation, handling of missing values, discretization of nominal columns, partitioning of data into training and test sets, and datanormalization. - Feature Generation - Partitioning - Binning (on distance) - By year (2007 for training, 2008 for testing) - Math Formula (to extract hours only) - Using a Partitioning node - Rule Engines (to calculate daily segments) - GroupBy (operational features) - Missing Values and Discretization - Normalization - In the dependent variable - In the independent variables Automatic Dimensionality Reduction and Feature Selection This sub-workflow shows many techniques to reduce dimensionality - Automatic dimensionality reduction methods: - Principal Component Analysis (PCA) (KNIME) - Singular Value Decomposition (SVD) (Python) - Machine learning-based methods to select the most informative features: - Backward Feature Elimination - Forward Feature Selection - Random Forest statistics - Machine learning-based methods to reduce the number of data rows - k-Means - k-NN PCs will be used to train a bag of models. We could have used the SVD reduced outputs or the outputs of any of the other feature selection methods too. Data Row Reduction with PCAPCA (Principal Component Analysis) is a statistical procedure that uses an orthogonaltransformation to convert a set of observations of possibly correlated variables into a set ofvalues of linearly uncorrelated variables called principal components. Further explanation of PCA can be found here: https://en.wikipedia.org/wiki/Principal_component_analysis Training predicting/testing writeonly if 90 < accuracy < 100 or the like? Node 389Node 595Node 596Node 597Node 598top: 2007 (training)bottom: 2008 (testing)Node 737Node 742group: carrierNode 756preprocessfiltercarrierspredict foreach carriername for WF... for each carrierNode 768Normalizer (PMML) CaptureWorkflow Start CaptureWorkflow End Workflow Combiner Workflow Writer Advanced ETL & MLbased Pre-processing Read blended data Row Splitter PCA Compute CaptureWorkflow End PCA Apply ReferenceColumn Filter Group Loop Start sort out irrelevant DecisionTree Learner PCA Apply Row Filter Decision TreePredictor String Manipulation(Variable) Loop End CaptureWorkflow Start Keep only >2rows per carrier

Nodes

Extensions

Links