Icon

04 Model Building on Big Data

04 Model Building on Big Data
Exercise 4: Model Building on Big DataIn this exercise you'll train a prediction model in Spark1) Create a local big data environment. You can use the default configuration.2) Read the "Spark/airline_training.parquet" and "Spark/airline_test.parquet" folders into Spark (Parquet to Spark nodes). If the filesdon't exist, execute the "Write parquet files" metanode first.3) Filter out the following columns from the training set:- ArrTime- DepTime- All .*Delay columns but the categorical DepartureDelay column- UniqueCarrier, TailNum, and Origin4) Train a Random Forest model to predict DepartureDelay- Apply "entropy" as the quality measure and increase the maximum depth to 105) Apply the model to the test set. Check the "Append individual class probabilities" box.6) Check the confusion matrix of the model7) Draw an ROC curve of the model Execute only if training and test setsdon't exist in HDFS testtrainNode 83Node 84Node 85Node 86Node 87Node 88 Create Local BigData Environment Write parquet files Parquet to Spark Parquet to Spark Spark Column Filter Spark RandomForest Learner Spark Predictor(Classification) Spark Scorer Spark to Table ROC Curve Exercise 4: Model Building on Big DataIn this exercise you'll train a prediction model in Spark1) Create a local big data environment. You can use the default configuration.2) Read the "Spark/airline_training.parquet" and "Spark/airline_test.parquet" folders into Spark (Parquet to Spark nodes). If the filesdon't exist, execute the "Write parquet files" metanode first.3) Filter out the following columns from the training set:- ArrTime- DepTime- All .*Delay columns but the categorical DepartureDelay column- UniqueCarrier, TailNum, and Origin4) Train a Random Forest model to predict DepartureDelay- Apply "entropy" as the quality measure and increase the maximum depth to 105) Apply the model to the test set. Check the "Append individual class probabilities" box.6) Check the confusion matrix of the model7) Draw an ROC curve of the model Execute only if training and test setsdon't exist in HDFS testtrainNode 83Node 84Node 85Node 86Node 87Node 88 Create Local BigData Environment Write parquet files Parquet to Spark Parquet to Spark Spark Column Filter Spark RandomForest Learner Spark Predictor(Classification) Spark Scorer Spark to Table ROC Curve

Nodes

Extensions

Links