Icon

02_​Spark_​Preprocessing

02 Spark Preprocessing Exercise Solution
Missing Values Strategy: 02_Spark_Preprocessing This workflow implements some data manipulation operations in Spark. 1. connects to Hive to read ss13pme and ss13hme data set and transfers data to Spark 2. filters, joins and aggregates data through Spark data manipulation nodesMake sure you have executed the /2_Hadoop/2_Exercises/00_Setup_Hive_Table workflow during your currentKNIME session before running this workflow. Spark Data Manipulation - Column Filter on ss13pme to remove PWGTP* & PUMA* columns - Joiner to join ss13pme and ss13hme on serial no - ss13pme: - Sorter on AGEP descending - SQL with "AS t LIMIT 10" - import into KNIME Spark Data Manipulation: On ss13pme do the following: - Column Filter to remove PWGTP* & PUMA* columns - Row Filter COW is NOT NULL - Row Filter COW is NULL & remove COW - GroupBy to calculate average AGEP for SEX groups load into KNIMEConnect to Local Big DataEnvironmentrm puma*& pwgtp*on SERIAL NOby agepdescendingSELECT * FROM #table# AS t LIMIT 10rm puma*& pwgtp*COW is NOT NULLCOW is NULLaverage AGEPfor each SEX grouprm cowselect * fromss13hme tableselect * fromss13pme tableconvert to Spark DataFrameconvert to Spark DataFrameSpark to Table Create Local BigData Environment Spark Column Filter Spark Joiner Spark Sorter Spark SQL Query Spark Column Filter Spark Row Filter Spark Row Filter Spark GroupBy Spark Column Filter DB Table Selector DB Table Selector Hive to Spark Hive to Spark Missing Values Strategy: 02_Spark_Preprocessing This workflow implements some data manipulation operations in Spark. 1. connects to Hive to read ss13pme and ss13hme data set and transfers data to Spark 2. filters, joins and aggregates data through Spark data manipulation nodesMake sure you have executed the /2_Hadoop/2_Exercises/00_Setup_Hive_Table workflow during your currentKNIME session before running this workflow. Spark Data Manipulation - Column Filter on ss13pme to remove PWGTP* & PUMA* columns - Joiner to join ss13pme and ss13hme on serial no - ss13pme: - Sorter on AGEP descending - SQL with "AS t LIMIT 10" - import into KNIME Spark Data Manipulation: On ss13pme do the following: - Column Filter to remove PWGTP* & PUMA* columns - Row Filter COW is NOT NULL - Row Filter COW is NULL & remove COW - GroupBy to calculate average AGEP for SEX groups load into KNIMEConnect to Local Big DataEnvironmentrm puma*& pwgtp*on SERIAL NOby agepdescendingSELECT * FROM #table# AS t LIMIT 10rm puma*& pwgtp*COW is NOT NULLCOW is NULLaverage AGEPfor each SEX grouprm cowselect * fromss13hme tableselect * fromss13pme tableconvert to Spark DataFrameconvert to Spark DataFrameSpark to Table Create Local BigData Environment Spark Column Filter Spark Joiner Spark Sorter Spark SQL Query Spark Column Filter Spark Row Filter Spark Row Filter Spark GroupBy Spark Column Filter DB Table Selector DB Table Selector Hive to Spark Hive to Spark

Nodes

Extensions

Links