Spark Repartition

This node returns a Spark DataFrame with increased or decreased partition count. This id useful to deal with performance issues in the following situations:

  • An uneven distribution of rows over partitions, which causes "straggler" tasks that delay the completion of a stage. A straggler task is a task that takes much longer than other tasks of the same stage.
  • A too low number of partitions, which prevents Spark from parallelizing computation.
  • A too high number of partitions with very little data in them, which causes unnecessary overhead.
  • Spark executors that crash or are very slow, because they run out of memory, due to partitions that contain too much data.

The following guidelines apply when repartitioning a DataFrame:

  • Before performing computation on a DataFrame (e.g. preprocessing or learning a model), the partition count should be at least a low multiple of the number of available executor cores in the Spark cluster (see respective option in the "Settings" tab). This ensures that Spark can properly parallelize computation. For very large data sets also high multiples of the available executor cores make sense, in order to avoid memory problems on the Spark executors.
  • Before writing a DataFrame to storage (HDFS, S3, ...) it is beneficial to aim for a partition count where partitions have a reasonable size e.g. 50M - 100M. This ensures fast writing and reading of the DataFrame.

Notes:

  • This node shuffles data which might be expensive. See the "Advanced" tab to avoid shuffling if possible.
  • This node requires at least Apache Spark 2.0.

Options

Settings

fixed value
Set partition count to a fixed value. Must be greater than zero.
multiply current partitions by
Multiply the current partition count by a factor. Must be greater than zero.
divide current partitions by
Divide the current partition count by a factor. Must be greater than zero.
multiply available executor cores by
Multiply the number of all available executor cores in the cluster by a given factor. Must be greater than zero.

Advanced

avoid shuffling
If option is selected and the partition count decreases, then coalesce will be used to avoid shuffling. If option is not selected or partition count increases, repartition with shuffling will be used.

Input Ports

Icon
Spark DataFrame to repartition.

Output Ports

Icon
Repartitioned Spark DataFrame.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.