0 ×

Spark Joiner

KNIME Extension for Apache Spark core infrastructure version 4.1.0.v201911281435 by KNIME AG, Zurich, Switzerland

This node joins two Spark DataFrame/RDDs in a database-like way. The join is based on the joining columns of both DataFrame/RDDs.

Options

Joiner settings

Join mode
If a row from the top DataFrame/RDD cannot be joined with a row from the bottom DataFrame/RDD (and vice versa) there are several options of handling this situation. After an Inner Join only matching rows will show up in the output DataFrame/RDD. A Left Outer Join will fill up the columns that come from the bottom DataFrame/RDD with missing values if no matching row exists in the bottom DataFrame/RDD. Likewise, a Right Outer Join will fill up the columns from the top DataFrame/RDD with missing values if no matching row in the top DataFrame/RDD exists. A Full Outer Join will fill up columns from both the top and bottom DataFrame/RDD with missing values if a row cannot be joined.
Joining columns
Select the columns from the top input ('left' table) and the bottom input ('right' table) that should be used for joining. You must make sure, that the type of selected columns matches.
Match all of the following: A row of the top input DataFrame/RDD and a row of the bottom input DataFrame/RDD match if they match in all specified column pairs.
Match any of the following: A row of the top input table and a row of the bottom input DataFrame/RDD match if they match in at least one specified column pairs.

Column selection

Column Selection (Top Input ('left' table) and Bottom Input ('right' table))

Include: This list contains the names of those columns in the input Spark DataFrame/RDD to be included.
Exclude: This list contains the names of the columns in the input Spark DataFrame/RDD to be excluded.
Filter: Use one of these fields to filter either the Include or Exclude list for certain column names or name substrings. Buttons: Use these buttons to move columns between the Include and Exclude list. Single-arrow buttons will move all selected columns. Double-arrow buttons will move all columns (filtering is taken into account).

Always include all columns: If set, all columns are moved to the include list and if input changes all new columns will be in the include list, too.
Duplicate column handling
The option allows you to change the behaviour if the include lists of both input DataFrame/RDDs contain columns with the same name.
Filter duplicates: Only the columns from the top input DataFrame/RDD will show up in the output DataFrame/RDD.
Don't execute: Don't allow to execute this node if there are duplicate column names in the include lists.
Append suffix: Append a suffix to the duplicate column names from the bottom input DataFrame/RDD so that they also show up in the output DataFrame/RDD.
Joining columns handling
The option allows filtering the top/bottom joining columns, i.e. the joining columns defined in the Joiner Settings tab will not show up in the output DataFrame/RDD.

Input Ports

DataFrame/RDD contributing to the left part of the output DataFrame/RDD
DataFrame/RDD contributing to the right part of the output DataFrame/RDD

Output Ports

Joined DataFrame/RDD

Best Friends (Incoming)

Best Friends (Outgoing)

Workflows

Installation

To use this node in KNIME, install KNIME Extension for Apache Spark from the following update site:

KNIME 4.1
Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.