Spark PCA

This node performs a principal component analysis (PCA) on the given data using the Apache Spark implementation. The input data is projected from its original feature space into a space of (possibly) lower dimension with a minimum of information loss.

Options

Fail if missing values are encountered

If checked, execution fails, when the selected columns contain missing values. By default, rows containing missing values are ignored and not considered in the computation of the principal components.

Target dimensions

Select the number of dimensions the input data is projected to. You can select either one of:

Dimensions to reduce to: Directly specify the number of target dimensions. The specified number must be lower or equal than the number of input columns.
Minimum information fraction to preserve (%): Specify the fraction in percentage of information to preserve from the input columns. This option requires Apache Spark 2.0 or higher.

Replace original data columns

If checked, the projected DataFrame/RDD will not contain columns that were included in the principal component analysis. Only the projected columns and the input columns that were not included in the principal component analysis remain.

Columns

Select columns that are included in the analysis of principal components, i.e the original features.

Input Ports

: Input Spark DataFrame/RDD

Output Ports

: The input DataFrame/RDD projected onto the principal components. Input columns that were not included in the principal component analysis are retained.
: A DataFrame/RDD with the principal components matrix.

Popular Predecessors

Popular Successors

Views

This node has no views

Workflows

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME Extension for Apache Spark (legacy) from the below update site following our NodePit Product and Node Installation Guide:

v5.5

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.5.0.v202506051107

On NodePit since: 2025-07-02

Last update: 2025-07-23

KNIME versions: Since v3.6

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!