Icon

m_​001_​import_​hive_​csv1

KNIME and Hive - load multiple CSV files at once via external table

KNIME and Hive - load multiple CSV files at once via external table. Also: toy around with internal and external tables and then create a Hive table with a partition taken from an existing table.

https://forum.knime.com/t/processing-hundreds-of-millions-of-records/13593/19?u=mlauber71 This workflow demonstrates how to import several CSV files at once without iteration using an external HIVE table.At the moment the downside is you have to manually define the initial structure where a DESCRIBE command might help you extracting the initial fields and settings. Due to restrictions with Hive and KNIME all columnsare imported as Strings and can later be converted to numbers. Also the heads will be imported several times because Hive/KNIME does not respect a certain setting to skip themYou could use a column as partition demonstrated in a separate stream in the node (the green square)Please Download the complete folder at: https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_hive_csv_loader/ use an existing column ("my_partition" in this case) to create Hive Partitions upload CSV files to/big_data/data1folder (HDFS)list all CSV files in /data/extract positioncreate big data environment/big_data/data3=> load single CSV fileinto Hivereaddata3selectdata3create external tabledata2read entire path=> problem with workflowset.metaturn the external table data2 intomanaged tabledata2_newwith PARQUET formatdata2_newdata2_newturn the table data3 intomanaged tabledata3_newwith PARQUET formatDESCRIBE EXTENDED `default`.`data3`DESCRIBE EXTENDED `default`.`data3_new`from data2_new eliminate theintermediate headersand CAST Var3 as DOUBLEdata2_new_cleanDROP TABLE IF EXISTS `default`.`data2_new_clean_partitioned`;data2_new_clean_partitionedPARTITIONED BY (my_partition STRING)under additional options^(?!my_partition).*$filter the variable "my_partition" that will serve as (well) partition laterINSERT external table data2 into data2_new_clean_partitionedusing my_partitionDESCRIBE EXTENDED `default`.`data2_new_clean_partitioned` Upload List Files String to URI Create Local BigData Environment DB SQL Executor DB Reader DB Table Selector DB SQL Executor DB SQL Executor DB Table Selector DB Reader DB SQL Executor DB Query Reader DB Query Reader Create sampleCSV data Metadata forBig Data DB SQL Executor DB Query Reader DB SQL Executor DB Table Creator Column Filter DB SQL Executor DB Query Reader https://forum.knime.com/t/processing-hundreds-of-millions-of-records/13593/19?u=mlauber71 This workflow demonstrates how to import several CSV files at once without iteration using an external HIVE table.At the moment the downside is you have to manually define the initial structure where a DESCRIBE command might help you extracting the initial fields and settings. Due to restrictions with Hive and KNIME all columnsare imported as Strings and can later be converted to numbers. Also the heads will be imported several times because Hive/KNIME does not respect a certain setting to skip themYou could use a column as partition demonstrated in a separate stream in the node (the green square)Please Download the complete folder at: https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_hive_csv_loader/ use an existing column ("my_partition" in this case) to create Hive Partitions upload CSV files to/big_data/data1folder (HDFS)list all CSV files in /data/extract positioncreate big data environment/big_data/data3=> load single CSV fileinto Hivereaddata3selectdata3create external tabledata2read entire path=> problem with workflowset.metaturn the external table data2 intomanaged tabledata2_newwith PARQUET formatdata2_newdata2_newturn the table data3 intomanaged tabledata3_newwith PARQUET formatDESCRIBE EXTENDED `default`.`data3`DESCRIBE EXTENDED `default`.`data3_new`from data2_new eliminate theintermediate headersand CAST Var3 as DOUBLEdata2_new_cleanDROP TABLE IF EXISTS `default`.`data2_new_clean_partitioned`;data2_new_clean_partitionedPARTITIONED BY (my_partition STRING)under additional options^(?!my_partition).*$filter the variable "my_partition" that will serve as (well) partition laterINSERT external table data2 into data2_new_clean_partitionedusing my_partitionDESCRIBE EXTENDED `default`.`data2_new_clean_partitioned` Upload List Files String to URI Create Local BigData Environment DB SQL Executor DB Reader DB Table Selector DB SQL Executor DB SQL Executor DB Table Selector DB Reader DB SQL Executor DB Query Reader DB Query Reader Create sampleCSV data Metadata forBig Data DB SQL Executor DB Query Reader DB SQL Executor DB Table Creator Column Filter DB SQL Executor DB Query Reader

Nodes

Extensions

Links