Icon

kn_​example_​hive_​csv_​gz_​external_​table

Hive - upload CSV GZIP files to HDFS and bring them together as an EXTERNAL table

Hive - upload CSV GZIP files to HDFS and bring them together as an EXTERNAL table
You have a gzipped CSV file that you would upload to a HDFS folder. That folder would then be 'adressed' as an external table by your Hive system. The data will then be inserted into a 'real' big data table.
The benefit: you can upload your compressed files in chunks and then just use them as a single data table in your big data system

Hive - upload CSV GZIP files to HDFS and bring them together as an EXTERNAL tableYou have a gzipped CSV file that you would upload to a HDFS folder. That folder would then be 'adressed' as an external table by your Hive system. The data will then be inserted into a 'real' big data table.The benefit: you can upload your compressed files in chunks and then just use them as a single data table in your big data system SET hive.vectorized.execution.enabled=false;SET skip.header.line.count = 1;CREATE EXTERNAL TABLE IF NOT EXISTS $${Sv_big_data_schema}$$.$${Sv_external_table_name}$$ ($${Sv_sql_string2}$$) COMMENT 'import data from GZIP CSV' ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' -- specify the delimiter used in your file, e.g., ',' for CSV files STORED AS TEXTFILE LOCATION "$${Sv_path_external_table}$$" TBLPROPERTIES ("skip.header.line.count"="1","compression.codec"="org.apache.hadoop.io.compress.GzipCodec", "compression.type"="BLOCK"); -- PARTITIONED BY (d_date STRING)STORED AS PARQUETCOMMENT 'this is my final table' TBLPROPERTIES ( 'parquet.compression'='snappy' , 'external.table.purge'='true' , 'transactional'='false' , 'discover.partitions' = 'false' , 'table.owner' = 'John Doe' , 'table.owner_team' = 'Data Analytics' , 'table.housekeeping'='{"info": "housekeeping by daily partition" , "attribute": "d_date" , "attribute_string_format": "yyyy-MM-dd" }' ) create the SQL string for the external table (you can also provide this by hand) just create a CSV/GZIP sample file - you can use your own data of course KNIME, Databases and SQLhttps://medium.com/p/273e27c9702a => deletes the wholelocal big data folder/big_datadecide in the configuration if you want sub-folders or parent foldersif you encouter any problems, closeKNIME and delete all data from the folder/big_data/$${Sv_big_data_schema}$$.$${Sv_external_table_name}$$=> convert CSV/GZIP files into external tablereadall data fromexteraml table=> just to checkin a real world exampleyou would maybe just selectfrom extermal tablemade of CSVDESCRIBE EXTENDEDexternal table start_date=> Date formatstart_timev_csv_gz_filea_external_table_folder!!! create path for connection(CONNECTED, local, ../big_data/upload)v_big_data_schema=> the name of yout big data schemaList all .gz files in theupload foldertimestamp_real_txtv_external_table_name=> the name of your external tablePath_GZ_filesclear .GZ filesDROP$${Sv_big_data_schema}$$.$${Sv_external_table_name}$$v_path_external_tableget just the structureWHERE 1=2readdb_020_load_from_single_csv_filev_name_final_tablecreate asempty tableDROP TABLE IF EXISTS $${Sv_big_data_schema}$$.$${Sv_name_final_table}$$;v_name_final_tablev_name_final_tablecreate folderfor externaltable ifnot existsLIMIT` 1000DESCRIBE EXTENDEDfinal tablewith open(input_file, 'rb') as f_in: with gzip.open(output_file, 'wb') as f_out: shutil.copyfileobj(f_in, f_out)=> create GZIP file with bundled Python version$Column Type$ LIKE "String*" =>"STRING"$Column Type$ LIKE "*long*" =>"BIGINT"$Column Type$ LIKE "*double*" =>"DOUBLE"$Column Type$ LIKE "*integer*" =>"INTEGER"$Column Type$ LIKE "*Time*" =>"TIMESTAMP"$Column Type$ LIKE "*Date*" =>"DATE"=> you might have to extend that if you havetimestamps or BIGINT columns or unknown typestransfer the .csv.gzfile(s) to the upload folderof the HDFS systemsearch for local.gz filesread CSV / GZIP filedirectlylocal big datacontext create DB SQL Executor DB Reader DB Table Selector DB Query Reader Java EditVariable (simple) Java EditVariable (simple) Extract Table Spec Metanode SQLstrings String to Path(Variable) Create File/FolderVariables String Input(legacy) List Files/Folders create timestamp Merge Variables String Input(legacy) Merge Variables Column Rename Delete Files/Folders(Table) DB SQL Executor Java EditVariable (simple) DB ConnectionExtractor DB Query DB Reader DB Table Creator DB SQL Executor Java EditVariable (simple) DB ConnectionTable Writer Create Folder DB Query DB ConnectionExtractor DB Query Reader Create sampleCSV data Python Script Rule Engine Transfer Files(Table) List Files/Folders CSV Reader Hive - upload CSV GZIP files to HDFS and bring them together as an EXTERNAL tableYou have a gzipped CSV file that you would upload to a HDFS folder. That folder would then be 'adressed' as an external table by your Hive system. The data will then be inserted into a 'real' big data table.The benefit: you can upload your compressed files in chunks and then just use them as a single data table in your big data system SET hive.vectorized.execution.enabled=false;SET skip.header.line.count = 1;CREATE EXTERNAL TABLE IF NOT EXISTS $${Sv_big_data_schema}$$.$${Sv_external_table_name}$$ ($${Sv_sql_string2}$$) COMMENT 'import data from GZIP CSV' ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' -- specify the delimiter used in your file, e.g., ',' for CSV files STORED AS TEXTFILE LOCATION "$${Sv_path_external_table}$$" TBLPROPERTIES ("skip.header.line.count"="1","compression.codec"="org.apache.hadoop.io.compress.GzipCodec", "compression.type"="BLOCK"); -- PARTITIONED BY (d_date STRING)STORED AS PARQUETCOMMENT 'this is my final table' TBLPROPERTIES ( 'parquet.compression'='snappy' , 'external.table.purge'='true' , 'transactional'='false' , 'discover.partitions' = 'false' , 'table.owner' = 'John Doe' , 'table.owner_team' = 'Data Analytics' , 'table.housekeeping'='{"info": "housekeeping by daily partition" , "attribute": "d_date" , "attribute_string_format": "yyyy-MM-dd" }' ) create the SQL string for the external table (you can also provide this by hand) just create a CSV/GZIP sample file - you can use your own data of course KNIME, Databases and SQLhttps://medium.com/p/273e27c9702a => deletes the wholelocal big data folder/big_datadecide in the configuration if you want sub-folders or parent foldersif you encouter any problems, closeKNIME and delete all data from the folder/big_data/$${Sv_big_data_schema}$$.$${Sv_external_table_name}$$=> convert CSV/GZIP files into external tablereadall data fromexteraml table=> just to checkin a real world exampleyou would maybe just selectfrom extermal tablemade of CSVDESCRIBE EXTENDEDexternal table start_date=> Date formatstart_timev_csv_gz_filea_external_table_folder!!! create path for connection(CONNECTED, local, ../big_data/upload)v_big_data_schema=> the name of yout big data schemaList all .gz files in theupload foldertimestamp_real_txtv_external_table_name=> the name of your external tablePath_GZ_filesclear .GZ filesDROP$${Sv_big_data_schema}$$.$${Sv_external_table_name}$$v_path_external_tableget just the structureWHERE 1=2readdb_020_load_from_single_csv_filev_name_final_tablecreate asempty tableDROP TABLE IF EXISTS $${Sv_big_data_schema}$$.$${Sv_name_final_table}$$;v_name_final_tablev_name_final_tablecreate folderfor externaltable ifnot existsLIMIT` 1000DESCRIBE EXTENDEDfinal tablewith open(input_file, 'rb') as f_in: with gzip.open(output_file, 'wb') as f_out: shutil.copyfileobj(f_in, f_out)=> create GZIP file with bundled Python version$Column Type$ LIKE "String*" =>"STRING"$Column Type$ LIKE "*long*" =>"BIGINT"$Column Type$ LIKE "*double*" =>"DOUBLE"$Column Type$ LIKE "*integer*" =>"INTEGER"$Column Type$ LIKE "*Time*" =>"TIMESTAMP"$Column Type$ LIKE "*Date*" =>"DATE"=> you might have to extend that if you havetimestamps or BIGINT columns or unknown typestransfer the .csv.gzfile(s) to the upload folderof the HDFS systemsearch for local.gz filesread CSV / GZIP filedirectlylocal big datacontext create DB SQL Executor DB Reader DB Table Selector DB Query Reader Java EditVariable (simple) Java EditVariable (simple) Extract Table Spec Metanode SQLstrings String to Path(Variable) Create File/FolderVariables String Input(legacy) List Files/Folders create timestamp Merge Variables String Input(legacy) Merge Variables Column Rename Delete Files/Folders(Table) DB SQL Executor Java EditVariable (simple) DB ConnectionExtractor DB Query DB Reader DB Table Creator DB SQL Executor Java EditVariable (simple) DB ConnectionTable Writer Create Folder DB Query DB ConnectionExtractor DB Query Reader Create sampleCSV data Python Script Rule Engine Transfer Files(Table) List Files/Folders CSV Reader

Nodes

Extensions

Links