Persist Spark DataFrame/RDD

This node persists (caches) the incoming SparkDataFrame/RDD using the specified persistence level. The different storage levels are described in detail in the Spark documentation.

Caching Spark DataFrames/RDDs might speed up operations that need to access the same DataFrame/RDD several times e.g. when working with the same DataFrame/RDD within a loop body in a KNIME workflow.

Options

Storage level

Defines the storage level to use for persisting the incoming Spark DataFrame/RDD. The available levels are:

Memory only:
Store DataFrame/RDD as deserialized Java objects in the JVM. If the DataFrame/RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
Memory and disk: Store DataFrame/RDD as deserialized Java objects in the JVM. If the DataFrame/RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
Memory only serialized: Store DataFrame/RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
Memory and disk serialized: Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
Disk only:Store the DataFrame/RDD partitions only on disk.
Off heap (experimental): Store DataFrame/RDD in serialized format in Tachyon. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Furthermore, as the DataFrames/RDDs reside in Tachyon, the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts from memory.
Custom: Allows you to define your own persistence level using the custom storage parameters.

Custom Storage Parameter

Use disk: If DataFrame/RDD should be cached on disk.
Use memory: If DataFrame/RDD should be cached in memory.
Use off heap: If DataFrame/RDD should be cached off heap. This is an experimental option.
Deserialized: If DataFrame/RDD should be cached in deserialized form.
Replication: The number of cluster nodes the DataFrame/RDD should be cached on.

Input Ports

: Spark DataFrame/RDD to persist.

Output Ports

: The persisted Spark DataFrame/RDD.

Popular Predecessors

Popular Successors

Views

This node has no views

Workflows

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME Extension for Apache Spark (legacy) from the below update site following our NodePit Product and Node Installation Guide:

v5.8

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.6.0.v202507151409

On NodePit since: 2025-10-17

Last update: 2025-11-22

KNIME versions: Since v3.6

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!