Spark PCA

This node performs a principal component analysis (PCA) on the given data using the Apache Spark implementation. The input data is projected from its original feature space into a space of (possibly) lower dimension with a minimum of information loss.

Options

Fail if missing values are encountered
If checked, execution fails, when the selected columns contain missing values. By default, rows containing missing values are ignored and not considered in the computation of the principal components.
Target dimensions
Select the number of dimensions the input data is projected to. You can select either one of:
  • Dimensions to reduce to: Directly specify the number of target dimensions. The specified number must be lower or equal than the number of input columns.
  • Minimum information fraction to preserve (%): Specify the fraction in percentage of information to preserve from the input columns. This option requires Apache Spark 2.0 or higher.
Replace original data columns
If checked, the projected DataFrame/RDD will not contain columns that were included in the principal component analysis. Only the projected columns and the input columns that were not included in the principal component analysis remain.
Columns
Select columns that are included in the analysis of principal components, i.e the original features.

Input Ports

Icon
Input Spark DataFrame/RDD

Output Ports

Icon
The input DataFrame/RDD projected onto the principal components. Input columns that were not included in the principal component analysis are retained.
Icon
A DataFrame/RDD with the principal components matrix.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.