Spark Correlation Filter

This node uses the model as generated by a Correlation node to determine which columns are redundant (i.e. correlated) and filters them out. The output will contain the reduced set of columns.

The filtering step works roughly as follows: For each column in the correlation model the count of correlated columns is determined given a threshold value for the correlation coefficient (specified in the dialog). The column with the most correlated columns is chosen to "survive" and all correlated columns are filtered out. This procedure is repeated until no more columns can be identified. The problem of finding a minimum set of columns to satisfy the constraints is difficult to solve analytically. This method applied here is known to be good approximation, however.

Options

Columns from Model
Displays the set of columns for which the model has information. These columns must also be present in the input data table. The (automatically) selected elements in the list will be present in the output table. This list can not be edited.
Correlation Threshold
Choose the correlation threshold here. The higher the value the fewer columns get filtered out. Hit Enter or click the "Calculate" to see a preview of the filtered columns. The counts of included vs. excluded columns are shown in the label.
Calculate
Click this button to update the statistics. It will determine the reduced set of columns using the procedure outlined above.

Input Ports

Icon
The model from the correlation node.
Icon
Numeric input data to filter. It must contain the set of columns that were used to create the correlation model. (Typically you connect the input data from the correlation node here.)

Output Ports

Icon
Filtered data from input.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.