Spark Joiner

This node joins two Spark DataFrame/RDDs in a database-like way. The join is based on the joining columns of both DataFrame/RDDs.

Options

Joiner settings

Join mode: If a row from the top DataFrame/RDD cannot be joined with a row from the bottom DataFrame/RDD (and vice versa) there are several options of handling this situation. After an Inner Join only matching rows will show up in the output DataFrame/RDD. A Left Outer Join will fill up the columns that come from the bottom DataFrame/RDD with missing values if no matching row exists in the bottom DataFrame/RDD. Likewise, a Right Outer Join will fill up the columns from the top DataFrame/RDD with missing values if no matching row in the top DataFrame/RDD exists. A Full Outer Join will fill up columns from both the top and bottom DataFrame/RDD with missing values if a row cannot be joined.
Joining columns: Select the columns from the top input ('left' table) and the bottom input ('right' table) that should be used for joining. You must make sure, that the type of selected columns matches.
Match all of the following: A row of the top input DataFrame/RDD and a row of the bottom input DataFrame/RDD match if they match in all specified column pairs.
Match any of the following: A row of the top input table and a row of the bottom input DataFrame/RDD match if they match in at least one specified column pairs.

Column selection

Column Selection (Top Input ('left' table) and Bottom Input ('right' table)): Include: This list contains the names of those columns in the input Spark DataFrame/RDD to be included.
Exclude: This list contains the names of the columns in the input Spark DataFrame/RDD to be excluded.
Filter: Use one of these fields to filter either the Include or Exclude list for certain column names or name substrings. Buttons: Use these buttons to move columns between the Include and Exclude list. Single-arrow buttons will move all selected columns. Double-arrow buttons will move all columns (filtering is taken into account).

Always include all columns: If set, all columns are moved to the include list and if input changes all new columns will be in the include list, too.
Duplicate column handling: The option allows you to change the behaviour if the include lists of both input DataFrame/RDDs contain columns with the same name.
Filter duplicates: Only the columns from the top input DataFrame/RDD will show up in the output DataFrame/RDD.
Don't execute: Don't allow to execute this node if there are duplicate column names in the include lists.
Append suffix: Append a suffix to the duplicate column names from the bottom input DataFrame/RDD so that they also show up in the output DataFrame/RDD.
Joining columns handling: The option allows filtering the top/bottom joining columns, i.e. the joining columns defined in the Joiner Settings tab will not show up in the output DataFrame/RDD.

Input Ports

: DataFrame/RDD contributing to the left part of the output DataFrame/RDD
: DataFrame/RDD contributing to the right part of the output DataFrame/RDD

Output Ports

: Joined DataFrame/RDD

Popular Predecessors

Hive to Spark15 %
Parquet to Spark13 %
CSV to Spark10 %
Table to Spark8 %
Database to Spark6 %
Show all 48 recommendations

Popular Successors

Spark to Parquet10 %
Spark to Table7 %
Spark to Hive6 %
Spark Category To Number5 %
Spark to Parquet4 %
Show all 46 recommendations

Views

This node has no views

Workflows

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME Extension for Apache Spark (legacy) from the below update site following our NodePit Product and Node Installation Guide:

v5.9

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.9.0.v202511131754

On NodePit since: 2025-12-11

Last update: 2026-01-01

KNIME versions: Since v3.6

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!