Icon

kn_​example_​hive_​school_​of

School of Hive - with KNIME's local Big Data environment (SQL for Big Data)

School of Hive - with KNIME's local Big Data environment (SQL for Big Data)

Demonstrates a collection of Hive functions using KNIME's local Big Data environment including creating table structures from scratch and from an existing file and working with partitions.
Partitions are an essential organizing principle of Big Data systems. They will make is easier to store and handel big data tables.
All examples are fully functional. You could switch out the local big data environment for your own one (Cloudera e.g.).

This example focusses on Hive-SQL script in executors. Similar effects could be achieved by using KNIME's DB nodes.
https://hub.knime.com/mlauber71/spaces/Public/latest/_db_sql_bigdata_hive_spark_meta_collection#externalresources





School of Hive - with KNIME's local Big Data environment (SQL for Big Data)Demonstrates a collection of Hive functions using KNIME's local Big Data environment including creating table structures from scratch and from an existing file and working with partitions.Partitions are an essential organizing principle of Big Data systems. They will make is easier to store and handel big data tables.All examples are fully functional. You could switch out the local big data environment for your own one (Cloudera e.g.).This example focusses on Hive-SQL script in executors. Similar effects could be achieved by using KNIME's DB nodes.https://hub.knime.com/mlauber71/spaces/Public/latest/_db_sql_bigdata_hive_spark_meta_collection#externalresources DROP TABLE IF EXISTS `default`.`sample_01`;CREATE TABLE `default`.`sample_01` ( v_sale DOUBLE COMMENT 'v_sale = amount of sale in $' , v_cluster BIGINT COMMENT 'v_cluster = cluster of customer' , v_name STRING COMMENT 'v_name = name of customer' , due_date DATE COMMENT 'due_date = due date' , current_timestamp TIMESTAMP COMMENT 'current_timestamp = a timestamp' ) PARTITIONED BY (partition_date STRING) COMMENT 'this is a sample table created purely from code' STORED AS PARQUET ; -- 'manually' insert values into a hive table-- declare partition in the INSERT statementINSERT INTO `default`.`sample_01` PARTITION (partition_date = '2019-12-15')VALUES (45.45, 2, 'Santa', '2020-01-02 00:00:00') , (34.8965, 4, 'Claus', '2020-01-03 00:00:00'); -- create a new table based on an existing oneDROP TABLE IF EXISTS `default`.`sample_02`; CREATE TABLE `default`.`sample_02` LIKE `default`.`sample_01` ; INSERT INTO `default`.`sample_02`SELECT * FROM `default`.`sample_00`; SET max_row_size=1mb;INSERT INTO `default`.`sample_02`SELECT * FROM `default`.`sample_01`; ALTER TABLE `default`.`sample_02`DROP IF EXISTS partition (partition_date = '2019-12-15'); Cloudera IMPALA - please note:- KNIME local Big Data environment does not suppport Cloudera Impala- in Impala is does make sense to calculate statistics for your big data tables (even ifthey come from Hive!) so the system will be able to better plan ahead- In KNIME you could combine Hive and Impala operations with flow variables. Theoperation for creating statistics in Impala would look like this:INVALIDATE METADATA `default`.`sample_02`; COMPUTE INCREMENTAL STATS `default`.`sample_02`; SHOW TABLE STATS `default`.`sample_02`; For additional examples about KNIME and Big Data please refer toA meta collection of KNIME and databases (SQL, Big Data/Hive/Impala and Spark/PySpark)https://hub.knime.com/mlauber71/spaces/Public/latest/_db_sql_bigdata_hive_spark_meta_collectionKNIME's example workflows on Big Data and Sparkhttps://hub.knime.com/knime/spaces/Examples/latest/10_Big_Data/KNIME BIg Data Coursehttps://hub.knime.com/knime/spaces/Education/latest/04%20KNIME%20Big%20Data%20Course/Combine Big Data, Spark and H2O.ai Sparkling Waterhttps://kni.me/w/SSefBbXvBELOuj8yKNIME and Hive - load multiple CSV files at once via external tablehttps://kni.me/w/O70S2RenR3hEyzLBadd fields to Hive tablehttps://kni.me/w/OrbBT7gmGtmdmKOpCreate HIVE table and show the way it was created and describe the structurehttps://kni.me/w/tZNSO4VV_rBzJMUUBig Data Analytics - Model Selection to Predict Flight Departure Delays on Hive & Sparkhttps://hub.knime.com/knime/spaces/Examples/latest/50_Applications/28_Predicting_Departure_Delays/02_Scaling_Analytics_w_BigData Useful commands to see what you have doneSHOW CREATE TABLE `default`.`sample_00`;DESCRIBE EXTENDED `default`.`sample_00`;SHOW PARTITIONS `default`.`sample_00`;--------In Impala you could also useSHOW TABLE STATS `default`.`sample_00`; This workflow has two sub-folders/data/ - for the well data/big_data/ - to store the local big data environmentYou could inspect the structure and delete the content of thewhole folder to get a clean start (again). Upload sample_00 into Big Data environment CREATE TABLE `default`.`sample_02`(`v_sale` DOUBLE COMMENT 'v_sale = amount of sale in $', `v_cluster` BIGINT COMMENT 'v_cluster = cluster ofcustomer', `v_name` STRING COMMENT 'v_name = name of customer', `due_date` TIMESTAMP COMMENT 'due_date = due date')PARTITIONED BY (`partition_date` STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'WITH SERDEPROPERTIES ('serialization.format' = '1')STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'TBLPROPERTIES ('transient_lastDdlTime' = '1617449177') I tried to accomplish two things in one workflow:1. demonstrate some basic concepts and (SQL) code to deal withBig Data environments in this case Hive tables - with emphasison partitions and to inspect the data structure created (both playimportant roles once you start using big data tables for real)2. do it using a KNIME environment and workflow since it is avery quick way to get your hands on a big data environment. Andit demonstrates that you could easily use KNIME to create andexecute HQL/SQL code on big data environments with genericcode (or with the built in DB nodes)So with KNIME again you do not have to choose. You can have itall the way from your humble desktop machine to some reallyhuge enterprise big data environments … Remove duplicates from table-- create new table with duplicatesDROP TABLE IF EXISTS `default`.`sample_03`; CREATE TABLE `default`.`sample_03` SELECT * -- add one day to date, DATE_ADD(`partition_date`, 1) AS `new_date`FROM `default`.`sample_00` ;SET max_row_size=1mb;INSERT INTO `default`.`sample_03`SELECT * -- subtract one day from date, DATE_SUB(`partition_date`, 1) AS `new_date`FROM `default`.`sample_00`; -- create a new table based on an existing oneDROP TABLE IF EXISTS `default`.`sample_04`; CREATE TABLE `default`.`sample_04` SELECT * FROM ( SELECT * , row_number() over (partition BY `v_name` ORDER BY `new_date`DESC) AS `rank_id`FROM `default`.`sample_03` ) t1-- only keep the entry that will be -- left on top of a group of identical v_nameWHERE `t1`.`rank_id` = 1; -- Date and Time operationsSELECT unix_timestamp() as current_date_as_unix_timestamp_seconds, to_date(from_unixtime(unix_timestamp())) as today_as_date, date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd') , 1) as yesterday_as_date, date_add(from_unixtime(unix_timestamp(),'yyyy-MM-dd') , 1) as tomorrow_as_date, to_date('2020-01-01') as fixed_date_string_to_date, unix_timestamp('2020-01-01', 'yyyy-MM-dd') as unix_timestamp_from_fixed_date, date_sub(to_date('2020-01-01'),1 ) as one_day_back, add_months('2020-01-31',1) as one_month_ahead, date_add(last_day(add_months('2020-02-29', -1)),1) as first_day_of_current_month, date_add(last_day(add_months('2020-02-29', -2)),1) as first_day_of_last_month, last_day('2020-02-15') as last_day_of_current_month, last_day(add_months('2020-02-15', -1)) as last_day_of_last_month, from_unixtime(unix_timestamp('2020-02-01', 'yyyy-MM-dd'),'yyyy-MM-dd') as fixed_date, unix_timestamp('10-Jun-2021 10.00.00 AM CEST', 'dd-MMM-yyyy hh.mm.ss a zzzz') as v_unix_timestamp_seconds_am, from_unixtime(1623312000, 'yyyy-MM-dd HH.mm.ss zzzZ') as v_from_unix_timestamp_am, unix_timestamp('10-Jun-2021 10.00.00 PM CEST', 'dd-MMM-yyyy hh.mm.ss a zzzz') as v_unix_timestamp_seconds_pm, from_unixtime(1623355200, 'yyyy-MM-dd HH.mm.ss zzzZ') as v_from_unix_timestamp_pm SHOW TABLES IN `default` KNIME, Databases and SQLhttps://medium.com/p/273e27c9702a create hive table from scratchwith a partition and different columns withdifferent types of dataremove table if it exists to have a clean startsample_00load datainto hivetabledefault.sample_00SELECT * FROM sample_00read the datasample_00SHOW CREATE TABLE `default`.`sample_00`SELECT * FROM sample_01insert data into table by manually declaring the relevant partitionSHOW CREATE TABLE `default`.`sample_01`show how the table was createdread the datasample_01SHOW PARTITIONS `default`.`sample_02`see wich partitions are therecreate a new table based on thestructure of an existing oneInsert sample_00 uploaded from fileinto the table created from existing examplesample_01into the new filesample_02the partition_date is the last columnand will serve as partitionSHOW CREATE TABLE `default`.`sample_02`show how the table was createdInsert sample_01 created from scratchinto the table created from existing examplesample_01into the new filesample_02the partition_date is the last columnand will serve as partitionmax_row_size - might be necessary to change if rowsget quite largeSHOW PARTITIONS `default`.`sample_02`see wich partitions are thereremove a single partition- could also make sense to execute before insert to make surethe partition is unique(if you do not plan on filling a specific partition up)SHOW PARTITIONS `default`.`sample_02`see wich partitions are thereyou coud extract the partitions and then do something with themDESCRIBE EXTENDED `default`.`sample_02`get informations about big data tableDESCRIBE `default`.`sample_02`get informations about big data tablewith just the columns and partitionscreate a new table and deliberatelyinsert duplicatesSELECT * FROM `default`.`sample_02`see the resultsSELECT * FROM `default`.`sample_02`see the results, a new partitionSELECT * FROM `default`.`sample_02`see the results, the new partition is goneremove duplicateswith row numberSELECT * FROM `default`.`sample_03`see the resultswith duplicatesSELECT * FROM `default`.`sample_04`see the resultswithout duplicatesFun with date anddatettimes in Hiveshow_create_sample_02.csvyou can save the command how to create your big data filedescribe_extended_sample_02.csvyou can save the command how to create your big data filedescribe_sample_02.csvyou can save the command how to create your big data file=> deletes the wholelocal big data folder/big_dataif you encouter any problems, closeKNIME and delete all data from the folder/big_data/SECONDSsample_10_timestampSELECT * FROM sample_10_timestampread the datasample_10_timestampSECONDSSHOW CREATE TABLE `default`.`sample_10_timestamp`show how the table was createdSHOW TABLES IN `default`you can not 'evaluate' this command but execute itSHOW TABLES IN `default` LIKE "sample_*1"you can not 'evaluate' this command but execute it1st create (empty)hive tablesample_00see additional options!structure_sample_01.table DB SQL Executor DB Table Remover DB Loader DB Table Selector DB Table Selector DB Reader DB Query Reader DB Table Selector DB SQL Executor DB Query Reader DB Reader DB Query Reader DB SQL Executor DB SQL Executor DB Query Reader DB SQL Executor DB Query Reader DB SQL Executor DB Query Reader Cell Splitter DB Query Reader DB Query Reader DB SQL Executor DB Query Reader DB Query Reader DB Query Reader DB SQL Executor DB Query Reader DB Query Reader DB Query Reader CSV Writer CSV Writer CSV Writer local big datacontext create UNIX Timestampto Date&Time DB SQL Executor DB Table Selector DB Reader UNIX Timestampto Date&Time DB Query Reader DB Query Reader DB Query Reader DB Table Creator Table Writer Prepare Data School of Hive - with KNIME's local Big Data environment (SQL for Big Data)Demonstrates a collection of Hive functions using KNIME's local Big Data environment including creating table structures from scratch and from an existing file and working with partitions.Partitions are an essential organizing principle of Big Data systems. They will make is easier to store and handel big data tables.All examples are fully functional. You could switch out the local big data environment for your own one (Cloudera e.g.).This example focusses on Hive-SQL script in executors. Similar effects could be achieved by using KNIME's DB nodes.https://hub.knime.com/mlauber71/spaces/Public/latest/_db_sql_bigdata_hive_spark_meta_collection#externalresources DROP TABLE IF EXISTS `default`.`sample_01`;CREATE TABLE `default`.`sample_01` ( v_sale DOUBLE COMMENT 'v_sale = amount of sale in $' , v_cluster BIGINT COMMENT 'v_cluster = cluster of customer' , v_name STRING COMMENT 'v_name = name of customer' , due_date DATE COMMENT 'due_date = due date' , current_timestamp TIMESTAMP COMMENT 'current_timestamp = a timestamp' ) PARTITIONED BY (partition_date STRING) COMMENT 'this is a sample table created purely from code' STORED AS PARQUET ; -- 'manually' insert values into a hive table-- declare partition in the INSERT statementINSERT INTO `default`.`sample_01` PARTITION (partition_date = '2019-12-15')VALUES (45.45, 2, 'Santa', '2020-01-02 00:00:00') , (34.8965, 4, 'Claus', '2020-01-03 00:00:00'); -- create a new table based on an existing oneDROP TABLE IF EXISTS `default`.`sample_02`; CREATE TABLE `default`.`sample_02` LIKE `default`.`sample_01` ; INSERT INTO `default`.`sample_02`SELECT * FROM `default`.`sample_00`; SET max_row_size=1mb;INSERT INTO `default`.`sample_02`SELECT * FROM `default`.`sample_01`; ALTER TABLE `default`.`sample_02`DROP IF EXISTS partition (partition_date = '2019-12-15'); Cloudera IMPALA - please note:- KNIME local Big Data environment does not suppport Cloudera Impala- in Impala is does make sense to calculate statistics for your big data tables (even ifthey come from Hive!) so the system will be able to better plan ahead- In KNIME you could combine Hive and Impala operations with flow variables. Theoperation for creating statistics in Impala would look like this:INVALIDATE METADATA `default`.`sample_02`; COMPUTE INCREMENTAL STATS `default`.`sample_02`; SHOW TABLE STATS `default`.`sample_02`; For additional examples about KNIME and Big Data please refer toA meta collection of KNIME and databases (SQL, Big Data/Hive/Impala and Spark/PySpark)https://hub.knime.com/mlauber71/spaces/Public/latest/_db_sql_bigdata_hive_spark_meta_collectionKNIME's example workflows on Big Data and Sparkhttps://hub.knime.com/knime/spaces/Examples/latest/10_Big_Data/KNIME BIg Data Coursehttps://hub.knime.com/knime/spaces/Education/latest/04%20KNIME%20Big%20Data%20Course/Combine Big Data, Spark and H2O.ai Sparkling Waterhttps://kni.me/w/SSefBbXvBELOuj8yKNIME and Hive - load multiple CSV files at once via external tablehttps://kni.me/w/O70S2RenR3hEyzLBadd fields to Hive tablehttps://kni.me/w/OrbBT7gmGtmdmKOpCreate HIVE table and show the way it was created and describe the structurehttps://kni.me/w/tZNSO4VV_rBzJMUUBig Data Analytics - Model Selection to Predict Flight Departure Delays on Hive & Sparkhttps://hub.knime.com/knime/spaces/Examples/latest/50_Applications/28_Predicting_Departure_Delays/02_Scaling_Analytics_w_BigData Useful commands to see what you have doneSHOW CREATE TABLE `default`.`sample_00`;DESCRIBE EXTENDED `default`.`sample_00`;SHOW PARTITIONS `default`.`sample_00`;--------In Impala you could also useSHOW TABLE STATS `default`.`sample_00`; This workflow has two sub-folders/data/ - for the well data/big_data/ - to store the local big data environmentYou could inspect the structure and delete the content of thewhole folder to get a clean start (again). Upload sample_00 into Big Data environment CREATE TABLE `default`.`sample_02`(`v_sale` DOUBLE COMMENT 'v_sale = amount of sale in $', `v_cluster` BIGINT COMMENT 'v_cluster = cluster ofcustomer', `v_name` STRING COMMENT 'v_name = name of customer', `due_date` TIMESTAMP COMMENT 'due_date = due date')PARTITIONED BY (`partition_date` STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'WITH SERDEPROPERTIES ('serialization.format' = '1')STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'TBLPROPERTIES ('transient_lastDdlTime' = '1617449177') I tried to accomplish two things in one workflow:1. demonstrate some basic concepts and (SQL) code to deal withBig Data environments in this case Hive tables - with emphasison partitions and to inspect the data structure created (both playimportant roles once you start using big data tables for real)2. do it using a KNIME environment and workflow since it is avery quick way to get your hands on a big data environment. Andit demonstrates that you could easily use KNIME to create andexecute HQL/SQL code on big data environments with genericcode (or with the built in DB nodes)So with KNIME again you do not have to choose. You can have itall the way from your humble desktop machine to some reallyhuge enterprise big data environments … Remove duplicates from table-- create new table with duplicatesDROP TABLE IF EXISTS `default`.`sample_03`; CREATE TABLE `default`.`sample_03` SELECT * -- add one day to date, DATE_ADD(`partition_date`, 1) AS `new_date`FROM `default`.`sample_00` ;SET max_row_size=1mb;INSERT INTO `default`.`sample_03`SELECT * -- subtract one day from date, DATE_SUB(`partition_date`, 1) AS `new_date`FROM `default`.`sample_00`; -- create a new table based on an existing oneDROP TABLE IF EXISTS `default`.`sample_04`; CREATE TABLE `default`.`sample_04` SELECT * FROM ( SELECT * , row_number() over (partition BY `v_name` ORDER BY `new_date`DESC) AS `rank_id`FROM `default`.`sample_03` ) t1-- only keep the entry that will be -- left on top of a group of identical v_nameWHERE `t1`.`rank_id` = 1; -- Date and Time operationsSELECT unix_timestamp() as current_date_as_unix_timestamp_seconds, to_date(from_unixtime(unix_timestamp())) as today_as_date, date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd') , 1) as yesterday_as_date, date_add(from_unixtime(unix_timestamp(),'yyyy-MM-dd') , 1) as tomorrow_as_date, to_date('2020-01-01') as fixed_date_string_to_date, unix_timestamp('2020-01-01', 'yyyy-MM-dd') as unix_timestamp_from_fixed_date, date_sub(to_date('2020-01-01'),1 ) as one_day_back, add_months('2020-01-31',1) as one_month_ahead, date_add(last_day(add_months('2020-02-29', -1)),1) as first_day_of_current_month, date_add(last_day(add_months('2020-02-29', -2)),1) as first_day_of_last_month, last_day('2020-02-15') as last_day_of_current_month, last_day(add_months('2020-02-15', -1)) as last_day_of_last_month, from_unixtime(unix_timestamp('2020-02-01', 'yyyy-MM-dd'),'yyyy-MM-dd') as fixed_date, unix_timestamp('10-Jun-2021 10.00.00 AM CEST', 'dd-MMM-yyyy hh.mm.ss a zzzz') as v_unix_timestamp_seconds_am, from_unixtime(1623312000, 'yyyy-MM-dd HH.mm.ss zzzZ') as v_from_unix_timestamp_am, unix_timestamp('10-Jun-2021 10.00.00 PM CEST', 'dd-MMM-yyyy hh.mm.ss a zzzz') as v_unix_timestamp_seconds_pm, from_unixtime(1623355200, 'yyyy-MM-dd HH.mm.ss zzzZ') as v_from_unix_timestamp_pm SHOW TABLES IN `default` KNIME, Databases and SQLhttps://medium.com/p/273e27c9702a create hive table from scratchwith a partition and different columns withdifferent types of dataremove table if it exists to have a clean startsample_00load datainto hivetabledefault.sample_00SELECT * FROM sample_00read the datasample_00SHOW CREATE TABLE `default`.`sample_00`SELECT * FROM sample_01insert data into table by manually declaring the relevant partitionSHOW CREATE TABLE `default`.`sample_01`show how the table was createdread the datasample_01SHOW PARTITIONS `default`.`sample_02`see wich partitions are therecreate a new table based on thestructure of an existing oneInsert sample_00 uploaded from fileinto the table created from existing examplesample_01into the new filesample_02the partition_date is the last columnand will serve as partitionSHOW CREATE TABLE `default`.`sample_02`show how the table was createdInsert sample_01 created from scratchinto the table created from existing examplesample_01into the new filesample_02the partition_date is the last columnand will serve as partitionmax_row_size - might be necessary to change if rowsget quite largeSHOW PARTITIONS `default`.`sample_02`see wich partitions are thereremove a single partition- could also make sense to execute before insert to make surethe partition is unique(if you do not plan on filling a specific partition up)SHOW PARTITIONS `default`.`sample_02`see wich partitions are thereyou coud extract the partitions and then do something with themDESCRIBE EXTENDED `default`.`sample_02`get informations about big data tableDESCRIBE `default`.`sample_02`get informations about big data tablewith just the columns and partitionscreate a new table and deliberatelyinsert duplicatesSELECT * FROM `default`.`sample_02`see the resultsSELECT * FROM `default`.`sample_02`see the results, a new partitionSELECT * FROM `default`.`sample_02`see the results, the new partition is goneremove duplicateswith row numberSELECT * FROM `default`.`sample_03`see the resultswith duplicatesSELECT * FROM `default`.`sample_04`see the resultswithout duplicatesFun with date anddatettimes in Hiveshow_create_sample_02.csvyou can save the command how to create your big data filedescribe_extended_sample_02.csvyou can save the command how to create your big data filedescribe_sample_02.csvyou can save the command how to create your big data file=> deletes the wholelocal big data folder/big_dataif you encouter any problems, closeKNIME and delete all data from the folder/big_data/SECONDSsample_10_timestampSELECT * FROM sample_10_timestampread the datasample_10_timestampSECONDSSHOW CREATE TABLE `default`.`sample_10_timestamp`show how the table was createdSHOW TABLES IN `default`you can not 'evaluate' this command but execute itSHOW TABLES IN `default` LIKE "sample_*1"you can not 'evaluate' this command but execute it1st create (empty)hive tablesample_00see additional options!structure_sample_01.table DB SQL Executor DB Table Remover DB Loader DB Table Selector DB Table Selector DB Reader DB Query Reader DB Table Selector DB SQL Executor DB Query Reader DB Reader DB Query Reader DB SQL Executor DB SQL Executor DB Query Reader DB SQL Executor DB Query Reader DB SQL Executor DB Query Reader Cell Splitter DB Query Reader DB Query Reader DB SQL Executor DB Query Reader DB Query Reader DB Query Reader DB SQL Executor DB Query Reader DB Query Reader DB Query Reader CSV Writer CSV Writer CSV Writer local big datacontext create UNIX Timestampto Date&Time DB SQL Executor DB Table Selector DB Reader UNIX Timestampto Date&Time DB Query Reader DB Query Reader DB Query Reader DB Table Creator Table Writer Prepare Data

Nodes

Extensions

Links