Icon

02_​Scaling_​Analytics_​w_​BigData

Select OptionsRemote executorsSelected years External Data Sources: - Weather - Geo-coordinates - Radar Image Data - Storm Watch Text Data - Calendar - IBM Watson Comments - Mechanics data from webcrawling Read Data & Apply Constraints (ETL) in Hive - Local or Remote file reading (unzip) - Constraints (delay definition, selected origin airport, remove cancelled flights) - Data Blending - Dimensionality Reduction Move from Hive to Spark 2.0Seamlessly convert a Hivequery into a Spark RDD Pre-processing in Spark 2.0Seamlessly convert a Hive query into a SparkRDD Model Selection in Spark 2.0 - Pre-processing (discretization, binning,normalization, rows reduction) - Bag of Models - Decision Tree & Random Forest - Neural Network & Deep Learning - Gradient Boosted Trees - Logistic Regression (native & fromPython) - Your own Ensemble Model - Current Model - Model Selection Save the Best Model inKNIME Analytics PlatformWrite the model locally Besides Hive and Spark, it is also possible to execute a KNIME node in streaming and to take advantage of the GPU computational power for some machine learning algorithms, like for example deep learning. Big Data Analytics - Model Selection to Predict Flight Departure Delays on Hive & SparkThis workflow trains a number of data analytics models on Hadoop and Spark and automatically selects the best model to predict departure delays from a selected airport. Data is the airline dataset downloadable from: http://stat-computing.org/dataexpo/2009/the-data.html. Departure delay is a delay > 15min. Default selected airport is ORD.This workflow implements data reading, data blending, ETL, guided analytics, dimensionality reduction, advanced data mining models, model selection using: Hadoop, Spark, in-memory, parallelization, grid computing, multithreading and/or in-database tospeed up computationally intensive operations. Data available in knime://knime.workflow/data/1_Input definedep delayRead data from local or http://Select options- big data platforms- data set for integration- origin airport- years for training/ testingintegration ofSpark ML libraryConvert a Hive query into Spark RDDtoo high correlationtoo high % missingtoo low variancebinningpartitioningnormalizationA/B testingAdd data fromexternaldata sources Constraints Read Data Select Options Bag of Models Write the Model To Spark DimensionalityReduction Pre-processing Model Selection Data Blending Select OptionsRemote executorsSelected years External Data Sources: - Weather - Geo-coordinates - Radar Image Data - Storm Watch Text Data - Calendar - IBM Watson Comments - Mechanics data from webcrawling Read Data & Apply Constraints (ETL) in Hive - Local or Remote file reading (unzip) - Constraints (delay definition, selected origin airport, remove cancelled flights) - Data Blending - Dimensionality Reduction Move from Hive to Spark 2.0Seamlessly convert a Hivequery into a Spark RDD Pre-processing in Spark 2.0Seamlessly convert a Hive query into a SparkRDD Model Selection in Spark 2.0 - Pre-processing (discretization, binning,normalization, rows reduction) - Bag of Models - Decision Tree & Random Forest - Neural Network & Deep Learning - Gradient Boosted Trees - Logistic Regression (native & fromPython) - Your own Ensemble Model - Current Model - Model Selection Save the Best Model inKNIME Analytics PlatformWrite the model locally Besides Hive and Spark, it is also possible to execute a KNIME node in streaming and to take advantage of the GPU computational power for some machine learning algorithms, like for example deep learning. Big Data Analytics - Model Selection to Predict Flight Departure Delays on Hive & SparkThis workflow trains a number of data analytics models on Hadoop and Spark and automatically selects the best model to predict departure delays from a selected airport. Data is the airline dataset downloadable from: http://stat-computing.org/dataexpo/2009/the-data.html. Departure delay is a delay > 15min. Default selected airport is ORD.This workflow implements data reading, data blending, ETL, guided analytics, dimensionality reduction, advanced data mining models, model selection using: Hadoop, Spark, in-memory, parallelization, grid computing, multithreading and/or in-database tospeed up computationally intensive operations. Data available in knime://knime.workflow/data/1_Input definedep delayRead data from local or http://Select options- big data platforms- data set for integration- origin airport- years for training/ testingintegration ofSpark ML libraryConvert a Hive query into Spark RDDtoo high correlationtoo high % missingtoo low variancebinningpartitioningnormalizationA/B testingAdd data fromexternaldata sources Constraints Read Data Select Options Bag of Models Write the Model To Spark DimensionalityReduction Pre-processing Model Selection Data Blending

Nodes

Extensions

Links