Icon

m_​001_​import_​hive_​parquet

KNIME and Hive - load multiple Parquet files at once via external table

This workflow demonstrates how to import several Parquet files at once without iteration using an external HIVE table.

The initial structure wil be derived from a sample of one of the files. The rule are very basic: String, Double and Int. You might add rules for BIGINT if you need them
You could use a column as partition.

Please Download the complete folder at: https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_hive_parquet_loader/

This workflow demonstrates how to import several Parquet files at once without iteration using an external HIVE table.The initial structure wil be derived from a sample of one of the files. The rule are very basic: String, Double and Int. You might add rules for BIGINT if you need themYou could use a column as partition.Please Download the complete folder at: https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_hive_parquet_loader/ DROP TABLE IF EXISTS `default`.`df_external`;-- create an external table that means -- the Parquet files will stay where they are in HDFS and this -- is basically a viewCREATE EXTERNAL TABLE IF NOT EXISTS `default`.`df_external` ( -- this is the SQL string to define the structure of the extranel file $${Sv_sql_string2}$$ ) COMMENT 'import data from Parquet files with extranel table'-- give the path location where the CSV files are stored:-- /local_big_data_/data_uploadSTORED AS PARQUET LOCATION '$${Sv_path_upload_big_data}$$' -- this setting is not respected by KNIME/Hive as of now -- it now seems to work for parquet filesTBLPROPERTIES ("skip.header.line.count"="1") With Impala it is possible to use a LIKE command for extrenaltables.https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_parquet.htmlCREATE EXTERNAL TABLE ingest_existing_files LIKE PARQUET'/user/etl/destination/datafile1.dat' STORED AS PARQUET LOCATION '/user/etl/destination'; Generate the SQL syntax for the external table with a rough rule to get the variable types (youmight change that to your needs). extract position df_externalcheck ifthe external data tables workedcreate external tabledf_externalread entire path=> problem with workflowset.metaEXTERNAL^(?!v_partition).*$filter the variable "v_partition" that will serve as (well) partition laterINSERT`default`.`df_external`INTO`default`.`df_internal`using v_partition(has to be last column)DESCRIBE EXTENDED`default`.`df_external`extract firstfiledf_externalextract firstfileDROP TABLE IF EXISTS`default`.`df_internal`INTERNALDESCRIBE EXTENDED`default`.`df_internal`df_internaldf_internal=> create a local big data contexif you encouter any problems, closeKNIME and delete all data from the folder/big_data/ and start overlist all parquet files in /data/upload all Parquet files to/big_data/data_uploadfolder (HDFS)Node 2618read firstparquet file toget the structure$Column Type$ LIKE "String*" =>"STRING"$Column Type$ LIKE "*long*" =>"BIGINT"$Column Type$ LIKE "*double*" =>"DOUBLE"$Column Type$ LIKE "*integer*" =>"INTEGER"$Column Type$ LIKE "*Time*" =>"TIMESTAMP"$Column Type$ LIKE "*Date*" =>"DATE"=> you might have to extend that if you havetimestamps or BIGINT columns or unknown typesdf_internalPARTITIONED BY (v_partition INT)under additional optionString to URI DB Reader DB SQL Executor Column Filter DB SQL Executor DB Query Reader Row Filter Extract Table Spec DB Table Selector crate sampleparquet files Row Filter DB SQL Executor DB Query Reader DB Table Selector DB Reader Metanode SQLstrings local big datacontext create List Files/Folders Transfer Files(Table) Path to String Table Rowto Variable Parquet Reader Rule Engine Merge Variables DB Table Creator This workflow demonstrates how to import several Parquet files at once without iteration using an external HIVE table.The initial structure wil be derived from a sample of one of the files. The rule are very basic: String, Double and Int. You might add rules for BIGINT if you need themYou could use a column as partition.Please Download the complete folder at: https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_hive_parquet_loader/ DROP TABLE IF EXISTS `default`.`df_external`;-- create an external table that means -- the Parquet files will stay where they are in HDFS and this -- is basically a viewCREATE EXTERNAL TABLE IF NOT EXISTS `default`.`df_external` ( -- this is the SQL string to define the structure of the extranel file $${Sv_sql_string2}$$ ) COMMENT 'import data from Parquet files with extranel table'-- give the path location where the CSV files are stored:-- /local_big_data_/data_uploadSTORED AS PARQUET LOCATION '$${Sv_path_upload_big_data}$$' -- this setting is not respected by KNIME/Hive as of now -- it now seems to work for parquet filesTBLPROPERTIES ("skip.header.line.count"="1") With Impala it is possible to use a LIKE command for extrenaltables.https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_parquet.htmlCREATE EXTERNAL TABLE ingest_existing_files LIKE PARQUET'/user/etl/destination/datafile1.dat' STORED AS PARQUET LOCATION '/user/etl/destination'; Generate the SQL syntax for the external table with a rough rule to get the variable types (youmight change that to your needs). extract positiondf_externalcheck ifthe external data tables workedcreate external tabledf_externalread entire path=> problem with workflowset.metaEXTERNAL^(?!v_partition).*$filter the variable "v_partition" that will serve as (well) partition laterINSERT`default`.`df_external`INTO`default`.`df_internal`using v_partition(has to be last column)DESCRIBE EXTENDED`default`.`df_external`extract firstfiledf_externalextract firstfileDROP TABLE IF EXISTS`default`.`df_internal`INTERNALDESCRIBE EXTENDED`default`.`df_internal`df_internaldf_internal=> create a local big data contexif you encouter any problems, closeKNIME and delete all data from the folder/big_data/ and start overlist all parquet files in /data/upload all Parquet files to/big_data/data_uploadfolder (HDFS)Node 2618read firstparquet file toget the structure$Column Type$ LIKE "String*" =>"STRING"$Column Type$ LIKE "*long*" =>"BIGINT"$Column Type$ LIKE "*double*" =>"DOUBLE"$Column Type$ LIKE "*integer*" =>"INTEGER"$Column Type$ LIKE "*Time*" =>"TIMESTAMP"$Column Type$ LIKE "*Date*" =>"DATE"=> you might have to extend that if you havetimestamps or BIGINT columns or unknown typesdf_internalPARTITIONED BY (v_partition INT)under additional optionString to URI DB Reader DB SQL Executor Column Filter DB SQL Executor DB Query Reader Row Filter Extract Table Spec DB Table Selector crate sampleparquet files Row Filter DB SQL Executor DB Query Reader DB Table Selector DB Reader Metanode SQLstrings local big datacontext create List Files/Folders Transfer Files(Table) Path to String Table Rowto Variable Parquet Reader Rule Engine Merge Variables DB Table Creator

Nodes

Extensions

Links