Icon

03 Manipulating Big Data - Solution

Solution to an L4-BD SELF-PACED COURSE exercise:
- Manipulate data on Hive with the DB nodes
- Perform ETL operations in Spark with the Spark nodes





Exercise 3: Manipulating Big DataIn this exercise you'll manipulate big data in Hive and Spark.1) Execute the workflow below and start from the DB Table Selector node. Filter the airline table to flights that were not cancelled and that have destination airport "AUS".2) Filter out the following columns: Cancelled, CancellationCode, Diverted, Year, Month3) Add a categorical column "DepartureDelay" with the value "Delay" if DepDelay > 15 and "NoDelay" otherwise. You can use the following query in the DB Querynode:SELECT *,CASE WHEN `DepDelay` > 15 THEN "Delay"ELSE "NoDelay" END as DepartureDelayfrom #table#4) Convert the result table into a Spark DataFrame5) In Spark: Partition the data into training and test sets. Use 50% split and apply the Take from top setting.6) In Spark: Replace missing integer values in the training set with the most frequent value and missing string values with a fixed value "unknown" (Spark MissingValue node)7) Apply the missing value handling to the test set. Replace the original columns. (Spark Transformations Applier node)8) Write the training and test sets into Parquet. Use the following folder names: "Spark/airline_training.parquet" and "Spark/airline_test.parquet" Only AUSNot cancelledAdd delay categoryCancelled, DivertedYear, Month50% training50% testingtake from topstring: unknowninteger: modetest.parquetselect airline tableInsert data intotableairline_training.parquetAirlineDataset.tablecreate airline DB Row Filter DB Query DB Column Filter Hive to Spark Spark Partitioning Spark Missing Value Spark TransformationsApplier Spark to Parquet DB Table Selector DB Loader Create Local BigData Environment Spark to Parquet Table Reader DB Table Creator Exercise 3: Manipulating Big DataIn this exercise you'll manipulate big data in Hive and Spark.1) Execute the workflow below and start from the DB Table Selector node. Filter the airline table to flights that were not cancelled and that have destination airport "AUS".2) Filter out the following columns: Cancelled, CancellationCode, Diverted, Year, Month3) Add a categorical column "DepartureDelay" with the value "Delay" if DepDelay > 15 and "NoDelay" otherwise. You can use the following query in the DB Querynode:SELECT *,CASE WHEN `DepDelay` > 15 THEN "Delay"ELSE "NoDelay" END as DepartureDelayfrom #table#4) Convert the result table into a Spark DataFrame5) In Spark: Partition the data into training and test sets. Use 50% split and apply the Take from top setting.6) In Spark: Replace missing integer values in the training set with the most frequent value and missing string values with a fixed value "unknown" (Spark MissingValue node)7) Apply the missing value handling to the test set. Replace the original columns. (Spark Transformations Applier node)8) Write the training and test sets into Parquet. Use the following folder names: "Spark/airline_training.parquet" and "Spark/airline_test.parquet" Only AUSNot cancelledAdd delay categoryCancelled, DivertedYear, Month50% training50% testingtake from topstring: unknowninteger: modetest.parquetselect airline tableInsert data intotableairline_training.parquetAirlineDataset.tablecreate airlineDB Row Filter DB Query DB Column Filter Hive to Spark Spark Partitioning Spark Missing Value Spark TransformationsApplier Spark to Parquet DB Table Selector DB Loader Create Local BigData Environment Spark to Parquet Table Reader DB Table Creator

Nodes

Extensions

Links