0 ×

Spark Repartition

KNIME Extension for Apache Spark core infrastructure version 4.3.1.v202101261633 by KNIME AG, Zurich, Switzerland

This node returns a Spark DataFrame with increased or decreased partition count. This id useful to deal with performance issues in the following situations:

  • An uneven distribution of rows over partitions, which causes "straggler" tasks that delay the completion of a stage. A straggler task is a task that takes much longer than other tasks of the same stage.
  • A too low number of partitions, which prevents Spark from parallelizing computation.
  • A too high number of partitions with very little data in them, which causes unnecessary overhead.
  • Spark executors that crash or are very slow, because they run out of memory, due to partitions that contain too much data.

The following guidelines apply when repartitioning a DataFrame:

  • Before performing computation on a DataFrame (e.g. preprocessing or learning a model), the partition count should be at least a low multiple of the number of available executor cores in the Spark cluster (see respective option in the "Settings" tab). This ensures that Spark can properly parallelize computation. For very large data sets also high multiples of the available executor cores make sense, in order to avoid memory problems on the Spark executors.
  • Before writing a DataFrame to storage (HDFS, S3, ...) it is beneficial to aim for a partition count where partitions have a reasonable size e.g. 50M - 100M. This ensures fast writing and reading of the DataFrame.

Notes:

  • This node shuffles data which might be expensive. See the "Advanced" tab to avoid shuffling if possible.
  • This node requires at least Apache Spark 2.0.

Options

Settings

fixed value
Set partition count to a fixed value. Must be greater than zero.
multiply current partitions by
Multiply the current partition count by a factor. Must be greater than zero.
divide current partitions by
Divide the current partition count by a factor. Must be greater than zero.
multiply available executor cores by
Multiply the number of all available executor cores in the cluster by a given factor. Must be greater than zero.

Advanced

avoid shuffling
If option is selected and the partition count decreases, then coalesce will be used to avoid shuffling. If option is not selected or partition count increases, repartition with shuffling will be used.

Input Ports

Icon
Spark DataFrame to repartition.

Output Ports

Icon
Repartitioned Spark DataFrame.

Best Friends (Incoming)

Best Friends (Outgoing)

Installation

To use this node in KNIME, install KNIME Extension for Apache Spark from the following update site:

KNIME 4.3

A zipped version of the software site can be downloaded here.

You don't know what to do with this link? Read our NodePit Product and Node Installation Guide that explains you in detail how to install nodes to your KNIME Analytics Platform.

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform. Browse NodePit from within KNIME, install nodes with just one click and share your workflows with NodePit Space.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.