12_Incremental_processing_Parquet_file

Incremental Data Processing with Parquet

In this workflow, we will use the NYC taxi dataset to show case a continous preprocessing and publishing of event data. Instead of the Group Loop Start node this workflow could executed once per week in order to preprocess and publish all data that has arrived during the week. The result is written as a separate Parquet file within the same folder for each run. To ensure the uniquness of the file for each run we use the year and week of each run as file prefix that is set via flow variable.
Since the folder stays the same and Parquet is reading all files within the same folder independent of their file name, this folder can be exposed as external table (e.g. in Hive or Impala) to power further analysis processes.

URL: KNIME File Handling Guide https://docs.knime.com/latest/analytics_platform_file_handling_guide/index.html

Nodes

Extensions

Download

To use this workflow in KNIME, download it from the below URL and open it in KNIME:

Download Workflow

Created by: dewi

Created at: 2019-02-07

On NodePit since: 2020-12-06

Last update: 2025-06-30

Created with KNIME version: v5.2.4

Tags: ParquetIncremental loadingNYC taxi datasetincremental processingbigdatafile handling

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!

12_​Incremental_​processing_​Parquet_​file

Nodes

Extensions

Links

Download

12_Incremental_processing_Parquet_file