Icon

03 Manipulating Big Data

03 Manipulating Big Data
Exercise 3: Manipulating Big DataIn this exercise you'll manipulate big data in Hive and Spark.1) Execute the workflow below and start from the DB Table Selector node. Filter the airline table to flights that were not cancelled and that have destination airport "AUS".2) Filter out the following columns: Cancelled, CancellationCode, Diverted, Year, Month3) Add a categorical column "DepartureDelay" with the value "Delay" if DepDelay > 15 and "NoDelay" otherwise. You can use the following query in the DB Querynode:SELECT *,CASE WHEN `DepDelay` > 15 THEN "Delay"ELSE "NoDelay" END as DepartureDelayfrom #table#4) Convert the result table into a Spark DataFrame5) In Spark: Partition the data into training and test sets. Use 50% split and apply the Take from top setting.6) In Spark: Replace missing integer values in the training set with the most frequent value and missing string values with a fixed value "unknown" (Spark MissingValue node)7) Apply the missing value handling to the test set. Replace the original columns. (Spark Transformations Applier node)8) Write the training and test sets into Parquet. Use the following folder names: "Spark/airline_training.parquet" and "Spark/airline_test.parquet" select airline tableInsert data intotableAirlineDataset.tablecreate airline DB Table Selector DB Loader Create Local BigData Environment Table Reader DB Table Creator Exercise 3: Manipulating Big DataIn this exercise you'll manipulate big data in Hive and Spark.1) Execute the workflow below and start from the DB Table Selector node. Filter the airline table to flights that were not cancelled and that have destination airport "AUS".2) Filter out the following columns: Cancelled, CancellationCode, Diverted, Year, Month3) Add a categorical column "DepartureDelay" with the value "Delay" if DepDelay > 15 and "NoDelay" otherwise. You can use the following query in the DB Querynode:SELECT *,CASE WHEN `DepDelay` > 15 THEN "Delay"ELSE "NoDelay" END as DepartureDelayfrom #table#4) Convert the result table into a Spark DataFrame5) In Spark: Partition the data into training and test sets. Use 50% split and apply the Take from top setting.6) In Spark: Replace missing integer values in the training set with the most frequent value and missing string values with a fixed value "unknown" (Spark MissingValue node)7) Apply the missing value handling to the test set. Replace the original columns. (Spark Transformations Applier node)8) Write the training and test sets into Parquet. Use the following folder names: "Spark/airline_training.parquet" and "Spark/airline_test.parquet" select airline tableInsert data intotableAirlineDataset.tablecreate airlineDB Table Selector DB Loader Create Local BigData Environment Table Reader DB Table Creator

Nodes

Extensions

Links