Linear Correlation

Calculates for each pair of selected columns a correlation coefficient, i.e. a measure of the correlation of the two variables.

Which correlation measure is applied depends on the types of the underlying variables:
numeric <-> numeric: Pearson's product-moment coefficient. Missing values in a column are ignored in such a way that for the computation of the correlation between two columns only complete records are taken into account. For instance, if there are three columns A, B and C and a row contains a missing value in column A but not in B and C, then the row will be ignored for computing the correlation between (A, B) and (A, C). It will not be ignored for the correlation between (B, C). This corresponds to the function cor(<data.frame>, use="pairwise.complete.obs") in the R statistics package.
The value of this measure ranges from -1 (strong negative correlation) to 1 (strong positive correlation). A value of 0 represents no linear correlation (the columns might still be highly dependent on each other, though).
The p-value for these columns indicates the probability of an uncorrelated system producing a correlation at least as extreme, if the mean of the correlation is zero and it follows a t-distribution with df degrees of freedom.
nominal <-> nominal: Pearson's chi square test on the contingency table. This value is then normalized to a range [0,1] using Cramer's V, whereby 0 represents no correlation and 1 a strong correlation. Missing values in nominal columns are treated such as they were a self-contained possible value. If one of the two columns contains more possible values than specified in the dialog (default 50), the correlation will not be computed.
The p-value for these columns indicates the probability of independent variables showing as extreme level of dependence. The value is the same as for a chi-square test of independence of variables in a contingency table.
Correlation measures for other pairs of columns are not available, they are represented by missing values in the output table and crosses in the accompanying view.

Options

Manual Selection

Include
This list contains the names of those columns in the input table to be included in the output table.
Exclude
This list contains the names of those columns in the input table to be excluded from the output table.
Filter
Use one of these fields to filter either the Include or Exclude list for certain column names or name substrings.
Buttons
Use these buttons to move columns between the Include and Exclude list. Single-arrow buttons will move all selected columns. Double-arrow buttons will move all columns (filtering is taken into account).
Enforce Inclusion
Select this option to enforce the current inclusion list to stay the same even if the input table specification changes. If some of the included columns are not available anymore, a warning is displayed. (New columns will automatically be added to the exclusion list.)
Enforce Exclusion
Select this option to enforce the current exclusion list to stay the same even if the input table specification changes. If some of the excluded columns are not available anymore, a warning is displayed. (New columns will automatically be added to the inclusion list.)

Wildcard/Regex Selection

Search Pattern
Type a search pattern which matches columns to move into the Include or Exclude list. Which list is used can be specified. You can use either Wildcards ('?' matching any character, '*' matching a sequence of any characters) or Regex. You can specify whether your pattern should be case sensitive.
Output column pairs
Select which column pairs of the selected columns should be included in the correlation measure table. If only compatible column pairs are included numeric <-> nominal pairs will be excluded. If only pairs with a valid correlation are included all pairs for which the correlation cannot be computed are excluded.
Possible Values Count
Select an upper bound for the number of possible values for each of the nominal columns. If more values are encountered in a nominal column, the column will be ignored (no correlation values will be computed).
p-value
Select which p-value should be computed for Pearson's product-moment coefficient.
  • "two-sided" corresponds to the probability of obtaining a correlation value that is at least as extreme as the observed correlation.
  • "one-sided (right)" corresponds to the probability of obtaining a correlation value that shows even greater positive association.
  • "one-sided (left)" corresponds to the probability of obtaining a correlation value that shows even greater negative association.
Note that the p-value for Pearson's chi square test is always one-sided.

Input Ports

Icon
Numeric input data to evaluate

Output Ports

Icon
Correlation variables, p-values and degrees of freedom.
Icon
Correlation variables in a matrix representation.
Icon
A model containing the correlation measures. This model is appropriate to be read by the Correlation Filter node.

Views

Correlation Matrix
Squared table view showing the pair-wise correlation values of all columns. The color range varies from dark red (strong negative correlation), over white (no correlation) to dark blue (strong positive correlation). If a correlation value for a pair of column is not available, the corresponding cell contains a missing value (shown as cross in the color view).

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.