Icon

04_​HDI_​Hive_​Spark

All inside BigData Predictive Approach. From Hive through Spark ETL till Spark Model Training.

This workflow reads CENSUS data from a Hive database in HDInsight; it then moves to Spark where it performs some ETL operations; and finally it trains a Spark decision tree model to predict COW values based on all other attributes. Data for this example come from the new CENSUS dataset which is publicly available and can be downloaded from: http://www.census.gov/programs-surveys/acs/data/pums.html A full explanation of all attributes can be found in: http://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict15.pdf

to KNIME Spark Context.You can define a Spark Context:- in the Preferences page under KNIME/KNIME Big DataExtensions/Spark. This would be the default Spark Context- with the Create Spark Context node All inside BigData Predictive Approach. From Hive through Spark ETL till Spark Model Training.Goal: Predicting COW values.This workflow reads CENSUS data from a Hive database in HDInsight; it then moves to Spark where it performs some ETL operations; and finally it trains a Spark decision tree model to predict COW valuesbased on all other attributes.Data for this example come from the new CENSUS dataset which is publicly available and can be downloaded from: http://www.census.gov/programs-surveys/acs/data/pums.html A full explanation of all attributes can be found in: http://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict15.pdf Dataset used in thisexample isavailable here. COW is NULLCOW is NOT NULLrm puma*& pwgtp*file ss13pme.csvwrite tableback into Hadoopback to KNIMENode 210remove cowpred_cow -> cowtrain on cowoutput ispred_cowto Parquet on Sparkback to KNIMEfix missing valuesand start cow classfrom 0fix missing valuesand start cow classfrom 0select * fromss13pme tableconnect to Hiveport 14000connect toHive on HDInsightInsert <hostname> and <Credentials> here.Use of Credentials is recommended over simpleusername and password.Credentials are defined at the workflow levelRight-click the workflow in KNIME Explorer panel andselect "Workflow Credentials".Parameter field might need customization.moving data from Hive to Sparkmoving datafrom Hive to SparkDatabase Row Filter Database Row Filter DatabaseColumn Filter File Reader Spark to Hive Database ConnectionTable Reader Spark Concatenate Spark Column Filter Spark Column Rename Spark DecisionTree Learner Spark Predictor Spark to Parquet Spark to Table Fix Missing Values Fix Missing Values Database TableSelector HttpFS Connection Hive Connector Hive to Spark Hive to Spark to KNIME Spark Context.You can define a Spark Context:- in the Preferences page under KNIME/KNIME Big DataExtensions/Spark. This would be the default Spark Context- with the Create Spark Context node All inside BigData Predictive Approach. From Hive through Spark ETL till Spark Model Training.Goal: Predicting COW values.This workflow reads CENSUS data from a Hive database in HDInsight; it then moves to Spark where it performs some ETL operations; and finally it trains a Spark decision tree model to predict COW valuesbased on all other attributes.Data for this example come from the new CENSUS dataset which is publicly available and can be downloaded from: http://www.census.gov/programs-surveys/acs/data/pums.html A full explanation of all attributes can be found in: http://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict15.pdf Dataset used in thisexample isavailable here. COW is NULLCOW is NOT NULLrm puma*& pwgtp*file ss13pme.csvwrite tableback into Hadoopback to KNIMENode 210remove cowpred_cow -> cowtrain on cowoutput ispred_cowto Parquet on Sparkback to KNIMEfix missing valuesand start cow classfrom 0fix missing valuesand start cow classfrom 0select * fromss13pme tableconnect to Hiveport 14000connect toHive on HDInsightInsert <hostname> and <Credentials> here.Use of Credentials is recommended over simpleusername and password.Credentials are defined at the workflow levelRight-click the workflow in KNIME Explorer panel andselect "Workflow Credentials".Parameter field might need customization.moving data from Hive to Sparkmoving datafrom Hive to SparkDatabase Row Filter Database Row Filter DatabaseColumn Filter File Reader Spark to Hive Database ConnectionTable Reader Spark Concatenate Spark Column Filter Spark Column Rename Spark DecisionTree Learner Spark Predictor Spark to Parquet Spark to Table Fix Missing Values Fix Missing Values Database TableSelector HttpFS Connection Hive Connector Hive to Spark Hive to Spark

Nodes

Extensions

Links