Icon

m_​001_​import_​hive_​csv

This workflow demonstrates several methods to import one or many CSV file into Hive

This workflow demonstrates several methods to import one or many CSV file into Hive

Demonstrated are direct Uploads where you create a Hive table with KNIME nodes. Or you copy your file to an /upload/ folder and point an external table to them. If they all have the same structure they will be read by Hive. You can then use this external file to further handel your files.
If the fies are very large you might have to use partitions. In the KNIME installemnt of the Hive drivers there is a problem with the headers of the CSV files. It is also demonstrated how to get rid of them.

Please familiarize yourself with the concepts of big data and partitions in order to use this. And please note. KNIME's local big data environment is just there to demonstrate the usage. It might work with your large files but it is called Big Data for areason ....
https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_hive_school_of?u=mlauber71

This workflow demonstrates several methods to import one or many CSV file into HiveDemonstrated are direct Uploads where you create a Hive table with KNIME nodes. Or you copy your file to an /upload/ folder and point an external table to them. If they all have the same structure they will be read by Hive. You can then use this external file to further handel your files.If the fies are very large you might have to use partitions. In the KNIME installemnt of the Hive drivers there is a problem with the headers of the CSV files. It is also demonstrated how to get rid of them.Please familiarize yourself with the concepts of big data and partitions in order to use this. And please note. KNIME's local big data environment is just there to demonstrate the usage. It might work with your large files but it is called Big Data for areason ....https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_hive_school_of?u=mlauber71 DROP TABLE IF EXISTS`default`.`db_030_load_from_multiple_csv_file`;-- create an external table that means -- the CSV files will stay where they areCREATE EXTERNAL TABLE IF NOT EXISTS`default`.`db_030_load_from_multiple_csv_file`-- the SQL string to define the variables is derived from the sampleCSV file-- you can also enter one pattern by hand if you want ($${Sv_sql_string2}$$) COMMENT 'import data from CSV' ROW FORMAT DELIMITED -- add the separator for the CSV files, here it is pipeFIELDS TERMINATED BY '|'-- give the path location where the CSV files are stored:-- /big_data_/uploadSTORED AS TEXTFILE LOCATION '$${Sv_path_upload_big_data}$$' -- this setting is not respected by KNIME/Hive as of nowTBLPROPERTIES ("skip.header.line.count"="1") => deletes the wholelocal big data folder/big_dataif you encouter any problems, closeKNIME and delete all data from the folder/big_data/create external table db_030_load_from_multiple_csv_fileread entire path=> problem with workflowset.metafirst CSV filewillserve as a patternCluster_0.csvlist CSV files and give back URLsuse KNIME protocolDROP TABLE IF EXISTS `default`.`db_050_cleaned_and_partitioned`;^(?!my_partition).*$filter the variable "my_partition" that will serve as (well) partition laterINSERT external table db_030_load_from_multiple_csv_file into db_050_cleaned_and_partitioned clean_partitionedusing my_partition (has to be the last column)DESCRIBE EXTENDEDdb_050_cleaned_and_partitioned$Column Type$ LIKE "String*" =>"STRING"$Column Type$ LIKE "*long*" =>"BIGINT"$Column Type$ LIKE "*double*" =>"DOUBLE"$Column Type$ LIKE "*integer*" =>"INTEGER"$Column Type$ LIKE "*Time*" =>"TIMESTAMP"$Column Type$ LIKE "*Date*" =>"DATE"=> you might have to extend that if you havetimestamps or BIGINT columns or unknown typesdb_050_cleaned_and_partitionedPARTITIONED BY (my_partition STRING)=> under tab additional optionsselectdb_050_cleaned_and_partitionedreaddb_050_cleaned_and_partitionedDateDate_Timeyyyy-MM-dd HH:mm[:ss[.SSS]]local big datacontext create DB SQL Executor CSV Reader Create sampleCSV data List Files/Folders DB SQL Executor Column Filter DB SQL Executor DB Query Reader Extract Table Spec Rule Engine DB Table Creator Merge Variables DB Table Selector DB Reader Metanode SQLstrings 035-045 external table from multipleCSV to clean PARQUET or ORC file 020 - load singleCSV file into hive 001 - upload CSV file to Hivesystem (with partitions) 000 - move CSV fileto upload folder String to Date&Time String to Date&Time This workflow demonstrates several methods to import one or many CSV file into HiveDemonstrated are direct Uploads where you create a Hive table with KNIME nodes. Or you copy your file to an /upload/ folder and point an external table to them. If they all have the same structure they will be read by Hive. You can then use this external file to further handel your files.If the fies are very large you might have to use partitions. In the KNIME installemnt of the Hive drivers there is a problem with the headers of the CSV files. It is also demonstrated how to get rid of them.Please familiarize yourself with the concepts of big data and partitions in order to use this. And please note. KNIME's local big data environment is just there to demonstrate the usage. It might work with your large files but it is called Big Data for areason ....https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_hive_school_of?u=mlauber71 DROP TABLE IF EXISTS`default`.`db_030_load_from_multiple_csv_file`;-- create an external table that means -- the CSV files will stay where they areCREATE EXTERNAL TABLE IF NOT EXISTS`default`.`db_030_load_from_multiple_csv_file`-- the SQL string to define the variables is derived from the sampleCSV file-- you can also enter one pattern by hand if you want ($${Sv_sql_string2}$$) COMMENT 'import data from CSV' ROW FORMAT DELIMITED -- add the separator for the CSV files, here it is pipeFIELDS TERMINATED BY '|'-- give the path location where the CSV files are stored:-- /big_data_/uploadSTORED AS TEXTFILE LOCATION '$${Sv_path_upload_big_data}$$' -- this setting is not respected by KNIME/Hive as of nowTBLPROPERTIES ("skip.header.line.count"="1") => deletes the wholelocal big data folder/big_dataif you encouter any problems, closeKNIME and delete all data from the folder/big_data/create external table db_030_load_from_multiple_csv_fileread entire path=> problem with workflowset.metafirst CSV filewillserve as a patternCluster_0.csvlist CSV files and give back URLsuse KNIME protocolDROP TABLE IF EXISTS `default`.`db_050_cleaned_and_partitioned`;^(?!my_partition).*$filter the variable "my_partition" that will serve as (well) partition laterINSERT external table db_030_load_from_multiple_csv_file into db_050_cleaned_and_partitioned clean_partitionedusing my_partition (has to be the last column)DESCRIBE EXTENDEDdb_050_cleaned_and_partitioned$Column Type$ LIKE "String*" =>"STRING"$Column Type$ LIKE "*long*" =>"BIGINT"$Column Type$ LIKE "*double*" =>"DOUBLE"$Column Type$ LIKE "*integer*" =>"INTEGER"$Column Type$ LIKE "*Time*" =>"TIMESTAMP"$Column Type$ LIKE "*Date*" =>"DATE"=> you might have to extend that if you havetimestamps or BIGINT columns or unknown typesdb_050_cleaned_and_partitionedPARTITIONED BY (my_partition STRING)=> under tab additional optionsselectdb_050_cleaned_and_partitionedreaddb_050_cleaned_and_partitionedDateDate_Timeyyyy-MM-dd HH:mm[:ss[.SSS]]local big datacontext create DB SQL Executor CSV Reader Create sampleCSV data List Files/Folders DB SQL Executor Column Filter DB SQL Executor DB Query Reader Extract Table Spec Rule Engine DB Table Creator Merge Variables DB Table Selector DB Reader Metanode SQLstrings 035-045 external table from multipleCSV to clean PARQUET or ORC file 020 - load singleCSV file into hive 001 - upload CSV file to Hivesystem (with partitions) 000 - move CSV fileto upload folder String to Date&Time String to Date&Time

Nodes

Extensions

Links