Icon

13_​H2O_​AutoML_​on_​Spark

H2O AutoML on Spark

This workflow trains classification models for the Airlines Delay dataset using H2O AutoML on Spark. The dataset is expected to be stored on S3 in parquet format. It is first read into the Spark cluster and preprocessed on Spark (missing value handling, normalization, etc.). Then, Sparkling Water is used to train both binary and muliclass classification models with H2O AutoML on the dataset. Last, the models are scored on the previously partitioned test data.

The Airlines Delay dataset and description for it can be found here: https://www.kaggle.com/giovamata/airlinedelaycauses
You can use the Parquet Writer node to write the dataset to S3 or, e.g., replace the Parquet to Spark node with the CSV Reader and Table to Spark nodes (note that using parquet provides a better performance of the whole process).

By increasing or removing the runtime limit for the H2O AutoML Learner nodes, better models might be learned.

Preprocessing with Spark nodes. Train/Test splitwith H2O nodes. Spark to H2O. Connect to S3. Connect to Spark and loaddata. Multiclass target creation. Train and score classification models with H2O AutoML. Specify locationof dataset.Stratified 70/30sampling.Train multiclassclassification models(limited to 10mins).Train binaryclassification models(limited to 10mins).Select featurecolumns.Create binary categorical column.Create multiclass categorical column.Select featurecolumns.Stratified 70/30sampling.Specify Livy URLand bucket forstaging area.Specify credentials.Amazon S3 Connector Parquet to Spark Create H2O SparklingWater Context Spark to H2O H2O Partitioning H2O AutoML Learner H2O Predictor(Classification) H2O AutoML Learner H2O Predictor(Classification) H2O Binomial Scorer Spark Column Filter Spark SQL Query Spark SQL Query Spark Column Filter Spark to H2O H2O Partitioning H2O MultinomialScorer Spark Missing Value Spark Normalizer Spark Missing Value Spark Normalizer Create SparkContext (Livy) AmazonAuthentication Preprocessing with Spark nodes. Train/Test splitwith H2O nodes. Spark to H2O. Connect to S3. Connect to Spark and loaddata. Multiclass target creation. Train and score classification models with H2O AutoML. Specify locationof dataset.Stratified 70/30sampling.Train multiclassclassification models(limited to 10mins).Train binaryclassification models(limited to 10mins).Select featurecolumns.Create binary categorical column.Create multiclass categorical column.Select featurecolumns.Stratified 70/30sampling.Specify Livy URLand bucket forstaging area.Specify credentials.Amazon S3 Connector Parquet to Spark Create H2O SparklingWater Context Spark to H2O H2O Partitioning H2O AutoML Learner H2O Predictor(Classification) H2O AutoML Learner H2O Predictor(Classification) H2O Binomial Scorer Spark Column Filter Spark SQL Query Spark SQL Query Spark Column Filter Spark to H2O H2O Partitioning H2O MultinomialScorer Spark Missing Value Spark Normalizer Spark Missing Value Spark Normalizer Create SparkContext (Livy) AmazonAuthentication

Nodes

Extensions

Links