0 ×

Persist Spark DataFrame/RDD

KNIME Extension for Apache Spark core infrastructure version 4.2.0.v202007072005 by KNIME AG, Zurich, Switzerland

This node persists (caches) the incoming SparkDataFrame/RDD using the specified persistence level. The different storage levels are described in detail in the Spark documentation.

Caching Spark DataFrames/RDDs might speed up operations that need to access the same DataFrame/RDD several times e.g. when working with the same DataFrame/RDD within a loop body in a KNIME workflow.

Options

Storage level
Defines the storage level to use for persisting the incoming Spark DataFrame/RDD. The available levels are:
  • Memory only:
    Store DataFrame/RDD as deserialized Java objects in the JVM. If the DataFrame/RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
  • Memory and disk: Store DataFrame/RDD as deserialized Java objects in the JVM. If the DataFrame/RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
  • Memory only serialized: Store DataFrame/RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
  • Memory and disk serialized: Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
  • Disk only:Store the DataFrame/RDD partitions only on disk.
  • Off heap (experimental): Store DataFrame/RDD in serialized format in Tachyon. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Furthermore, as the DataFrames/RDDs reside in Tachyon, the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts from memory.
  • Custom: Allows you to define your own persistence level using the custom storage parameters.
Custom Storage Parameter
  • Use disk: If DataFrame/RDD should be cached on disk.
  • Use memory: If DataFrame/RDD should be cached in memory.
  • Use off heap: If DataFrame/RDD should be cached off heap. This is an experimental option.
  • Deserialized: If DataFrame/RDD should be cached in deserialized form.
  • Replication: The number of cluster nodes the DataFrame/RDD should be cached on.

Input Ports

Icon
Spark DataFrame/RDD to persist.

Output Ports

Icon
The persisted Spark DataFrame/RDD.

Best Friends (Incoming)

Best Friends (Outgoing)

Workflows

Installation

To use this node in KNIME, install KNIME Extension for Apache Spark from the following update site:

KNIME 4.2

A zipped version of the software site can be downloaded here. Read our FAQs to get instructions about how to install nodes from a zipped update site.

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.