Icon

03 Manipulating Big Data

03 Manipulating Big Data
Exercise 3: Manipulating Big DataIn this exercise you'll manipulate big data in Hive and Spark.1) Execute the workflow below and start from the DB Table Selector node. Filter the "airline" table to flights that were not cancelled and that have destination airport "AUS".2) Filter out the following columns: Cancelled, CancellationCode, Diverted, Year, Month3) Add a categorical column "DepartureDelay" with the value "Delay" if DepDelay > 15 and "NoDelay" otherwise. You can use the following query in the DB Querynode:SELECT *,CASE WHEN `DepDelay` > 15 THEN "Delay"ELSE "NoDelay" END as DepartureDelayfrom #table#4) Convert the result table into a Spark DataFrame5) In Spark: Partition the data into training and test sets. Use 50% split and apply the "Take from top" setting.6) In Spark: Replace missing integer values in the training set with the most frequent value and missing string values with fix value "unknown" (Spark Missing Valuenode)7) Apply the missing value handling to the test set. Replace the original columns. (Spark Transformations Applier node)8) Write the training and test sets into Parquet. Use the following folder names: "Spark/airline_training.parquet" and "Spark/airline_test.parquet" select airline tableInsert data intotablecreate airlineAirlineDataset.tableNode 64Node 65Node 66Node 67Node 68Node 69Node 70Node 71Node 72 DB Table Selector DB Loader Create Local BigData Environment DB Table Creator Table Reader DB Row Filter DB Column Filter DB Query Hive to Spark Spark Partitioning Spark Missing Value Spark TransformationsApplier Spark to Parquet Spark to Parquet Exercise 3: Manipulating Big DataIn this exercise you'll manipulate big data in Hive and Spark.1) Execute the workflow below and start from the DB Table Selector node. Filter the "airline" table to flights that were not cancelled and that have destination airport "AUS".2) Filter out the following columns: Cancelled, CancellationCode, Diverted, Year, Month3) Add a categorical column "DepartureDelay" with the value "Delay" if DepDelay > 15 and "NoDelay" otherwise. You can use the following query in the DB Querynode:SELECT *,CASE WHEN `DepDelay` > 15 THEN "Delay"ELSE "NoDelay" END as DepartureDelayfrom #table#4) Convert the result table into a Spark DataFrame5) In Spark: Partition the data into training and test sets. Use 50% split and apply the "Take from top" setting.6) In Spark: Replace missing integer values in the training set with the most frequent value and missing string values with fix value "unknown" (Spark Missing Valuenode)7) Apply the missing value handling to the test set. Replace the original columns. (Spark Transformations Applier node)8) Write the training and test sets into Parquet. Use the following folder names: "Spark/airline_training.parquet" and "Spark/airline_test.parquet" select airline tableInsert data intotablecreate airlineAirlineDataset.tableNode 64Node 65Node 66Node 67Node 68Node 69Node 70Node 71Node 72 DB Table Selector DB Loader Create Local BigData Environment DB Table Creator Table Reader DB Row Filter DB Column Filter DB Query Hive to Spark Spark Partitioning Spark Missing Value Spark TransformationsApplier Spark to Parquet Spark to Parquet

Nodes

Extensions

Links