Spark Repartition

This node returns a Spark DataFrame with increased or decreased partition count. This id useful to deal with performance issues in the following situations:

An uneven distribution of rows over partitions, which causes "straggler" tasks that delay the completion of a stage. A straggler task is a task that takes much longer than other tasks of the same stage.
A too low number of partitions, which prevents Spark from parallelizing computation.
A too high number of partitions with very little data in them, which causes unnecessary overhead.
Spark executors that crash or are very slow, because they run out of memory, due to partitions that contain too much data.

The following guidelines apply when repartitioning a DataFrame:

Before performing computation on a DataFrame (e.g. preprocessing or learning a model), the partition count should be at least a low multiple of the number of available executor cores in the Spark cluster (see respective option in the "Settings" tab). This ensures that Spark can properly parallelize computation. For very large data sets also high multiples of the available executor cores make sense, in order to avoid memory problems on the Spark executors.
Before writing a DataFrame to storage (HDFS, S3, ...) it is beneficial to aim for a partition count where partitions have a reasonable size e.g. 50M - 100M. This ensures fast writing and reading of the DataFrame.

Notes:

This node shuffles data which might be expensive. See the "Advanced" tab to avoid shuffling if possible.
This node requires at least Apache Spark 2.0.

Options

Settings

fixed value: Set partition count to a fixed value. Must be greater than zero.
multiply current partitions by: Multiply the current partition count by a factor. Must be greater than zero.
divide current partitions by: Divide the current partition count by a factor. Must be greater than zero.
multiply available executor cores by: Multiply the number of all available executor cores in the cluster by a given factor. Must be greater than zero.

Advanced

avoid shuffling: If option is selected and the partition count decreases, then coalesce will be used to avoid shuffling. If option is not selected or partition count increases, repartition with shuffling will be used.

Input Ports

: Spark DataFrame to repartition.

Output Ports

: Repartitioned Spark DataFrame.

Popular Predecessors

Popular Successors

Views

This node has no views

Workflows

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME Extension for Apache Spark (legacy) from the below update site following our NodePit Product and Node Installation Guide:

v5.5

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.5.0.v202506051107

On NodePit since: 2025-07-02

Last update: 2025-07-21

KNIME versions: Since v4.0

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!