Spark Correlation Filter

This node uses the model as generated by a Correlation node to determine which columns are redundant (i.e. correlated) and filters them out. The output will contain the reduced set of columns.

The filtering step works roughly as follows: For each column in the correlation model the count of correlated columns is determined given a threshold value for the correlation coefficient (specified in the dialog). The column with the most correlated columns is chosen to "survive" and all correlated columns are filtered out. This procedure is repeated until no more columns can be identified. The problem of finding a minimum set of columns to satisfy the constraints is difficult to solve analytically. This method applied here is known to be good approximation, however.

Options

Columns from Model: Displays the set of columns for which the model has information. These columns must also be present in the input data table. The (automatically) selected elements in the list will be present in the output table. This list can not be edited.
Correlation Threshold: Choose the correlation threshold here. The higher the value the fewer columns get filtered out. Hit Enter or click the "Calculate" to see a preview of the filtered columns. The counts of included vs. excluded columns are shown in the label.
Calculate: Click this button to update the statistics. It will determine the reduced set of columns using the procedure outlined above.

Input Ports

: The model from the correlation node.
: Numeric input data to filter. It must contain the set of columns that were used to create the correlation model. (Typically you connect the input data from the correlation node here.)

Output Ports

: Filtered data from input.

Popular Predecessors

Popular Successors

Views

This node has no views

Workflows

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME Extension for Apache Spark (legacy) from the below update site following our NodePit Product and Node Installation Guide:

v5.3

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.3.0.v202406141453

On NodePit since: 2024-07-09

Last update: 2024-07-26

KNIME versions: Since v3.6

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!