Icon

04 Model Building on Big Data - Solution

Solution to an L4-BD SELF-PACED COURSE exercise:
- Train a ML model in Spark
- Read the prediction results into KNIME





Exercise 4: Model Building on Big DataIn this exercise you'll train a prediction model in Spark1) Create a local big data environment. You can use the default configuration.2) Read the Spark/airline_training.parquet and Spark/airline_test.parquet folders into Spark (Parquet to Spark nodes). If the files don'texist, execute the Write parquet files metanode first.3) Filter out the following columns from the training set:- ArrTime- DepTime- All .*Delay columns but the target column DepartureDelay- UniqueCarrier, TailNum, and Origin4) Train a Random Forest model to predict departure delay- Apply entropy as the quality measure and increase the maximum depth to 105) Apply the model to the test set. Append individual class probabilities.6) Check the confusion matrix of the model7) Draw an ROC curve of the model Execute only if training and test setsdon't exist airline_training.parquetairline_test.parquetDelay columnsArrTimePredict DepartureDelay Parquet to Spark Create Local BigData Environment Parquet to Spark Spark Column Filter Spark RandomForest Learner Spark Predictor(Classification) Spark to Table ROC Curve Spark Scorer Write parquet files Exercise 4: Model Building on Big DataIn this exercise you'll train a prediction model in Spark1) Create a local big data environment. You can use the default configuration.2) Read the Spark/airline_training.parquet and Spark/airline_test.parquet folders into Spark (Parquet to Spark nodes). If the files don'texist, execute the Write parquet files metanode first.3) Filter out the following columns from the training set:- ArrTime- DepTime- All .*Delay columns but the target column DepartureDelay- UniqueCarrier, TailNum, and Origin4) Train a Random Forest model to predict departure delay- Apply entropy as the quality measure and increase the maximum depth to 105) Apply the model to the test set. Append individual class probabilities.6) Check the confusion matrix of the model7) Draw an ROC curve of the model Execute only if training and test setsdon't exist airline_training.parquetairline_test.parquetDelay columnsArrTimePredict DepartureDelay Parquet to Spark Create Local BigData Environment Parquet to Spark Spark Column Filter Spark RandomForest Learner Spark Predictor(Classification) Spark to Table ROC Curve Spark Scorer Write parquet files

Nodes

Extensions

Links