Persist Spark DataFrame/RDD

This node persists (caches) the incoming SparkDataFrame/RDD using the specified persistence level. The different storage levels are described in detail in the Spark documentation.

Caching Spark DataFrames/RDDs might speed up operations that need to access the same DataFrame/RDD several times e.g. when working with the same DataFrame/RDD within a loop body in a KNIME workflow.

Options

Storage level
Defines the storage level to use for persisting the incoming Spark DataFrame/RDD. The available levels are:
  • Memory only:
    Store DataFrame/RDD as deserialized Java objects in the JVM. If the DataFrame/RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
  • Memory and disk: Store DataFrame/RDD as deserialized Java objects in the JVM. If the DataFrame/RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
  • Memory only serialized: Store DataFrame/RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
  • Memory and disk serialized: Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
  • Disk only:Store the DataFrame/RDD partitions only on disk.
  • Off heap (experimental): Store DataFrame/RDD in serialized format in Tachyon. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Furthermore, as the DataFrames/RDDs reside in Tachyon, the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts from memory.
  • Custom: Allows you to define your own persistence level using the custom storage parameters.
Custom Storage Parameter
  • Use disk: If DataFrame/RDD should be cached on disk.
  • Use memory: If DataFrame/RDD should be cached in memory.
  • Use off heap: If DataFrame/RDD should be cached off heap. This is an experimental option.
  • Deserialized: If DataFrame/RDD should be cached in deserialized form.
  • Replication: The number of cluster nodes the DataFrame/RDD should be cached on.

Input Ports

Icon
Spark DataFrame/RDD to persist.

Output Ports

Icon
The persisted Spark DataFrame/RDD.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.