Spark RDD Java Snippet (Source)

This node allows you to execute arbitrary java code to create a Spark RDD e.g. by reading a file from HDFS (See provided templates). Simply enter the java code in the text area.

Note, that this node also supports flow variables as input to your Spark job. To use a flow variable simply double click on the variable in the "Flow Variable List".

It is also possible to use external java libraries. In order to include such external jar or zip files, add their location in the "Additional Libraries" tab using the control buttons. For details see the "Additional Libraries" tab description below.
The used libraries need to be present on your cluster and added to the class path of your Spark job server. They are not automatically uploaded!

You can define reusable templates with the "Create templates..." button. Templates are stored in the users workspace by default and can be accessed via the "Templates" tab. For details see the "Templates" tab description below.

For Spark 2.2 and above, this node compiles the snippet code with Java 8 support, otherwise it uses Java 7.

Options

Java Snippet

Flow Variable List

The list contains the flow variables that are currently available at the node input. Double clicking any of the entries will insert the respective identifier at the current cursor position (replacing the selection, if any).

Snippet text area

Enter your java code here.

The JavaSparkContext can be accessed via the method input parameter sc.

Output Schema:
The schema (e.g. data table specification) of the returned JavaRDD<Row> is by default derived automatically by looking at the top 10 rows of the returned JavaRDD<Row>. However you can also specify the schema programmatically by overwriting the getSchema() method. For an example on how to implement the method have a look at the "Create result schema manually" template in the "Templates" tab.

Flow variables:
You can access input flow variables by defining them in the Input table. To define a flow variable simply double click on the variable in the "Flow Variable list".

You can hit ctrl+space to get an auto completion box with all available classes, methods and fields. When you select a class and hit enter a import statement will be generated if missing.

Note, that the snippet allows to define custom global variables and custom imports. To view the hidden editor parts simply click on the plus symbols in the editor.

Input

Define system input fields for the snippet text area. Every field will be populated with the data of the defined input during execution.

Additional Libraries

Allows you to add additional jar files to the java snippet class path.
The used libraries need to be present on your cluster and added to the class path of your Spark job server. They are not automatically uploaded!

Add File(s): Allows you to include local jar files.
Add KNIME URL...: Allows you to add workflow relative jar files.

Templates

Provides predefined templates and allows you to define new reusable templates by saving the current snippet state.

Category: Groups templates into different categories.
Apply: Overwrites the current node settings with the template settings.
Java Snippet: Preview of the template code.
Additional Libraries: Preview of the additional jars.

Input Ports

: Required Spark context.

Output Ports

: The new created Spark RDD.

Popular Predecessors

Create Local Big Data Environment53 %
Spark RDD Java Snippet20 %
Spark DataFrame Java Snippet (Source)7 %
Hive to Spark7 %
Create Local Big Data Environment7 %
Show all 6 recommendations

Popular Successors

Views

This node has no views

Workflows

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME Extension for Apache Spark (legacy) from the below update site following our NodePit Product and Node Installation Guide:

v5.6

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.6.0.v202507151409

On NodePit since: 2025-08-15

Last update: 2025-08-16

KNIME versions: Since v3.6

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!