# Correlation Extension

The Correlation Extension node is designed to take an 'Input Correlation Matrix' and intelligently extend it using a list of 'Input Correlation Pairs'.

The Correlation Matrix represents the degree of Horizontal Differentiation between Features, Benefits, Attributes, Levels, and Products. The Correlation Matrix may be used by a downstream node (such as the Matrix Distributions node or the Feature Generation node) to generate a set of Customer Distributions comprising the Willingness To Pay (WTP) of individual Virtual Customers.

For example, the 'Input Correlation Matrix' may be a 3x3 matrix of correlation values (doubles between -1.0 and +1.0) with row names and column names of 'A', 'B', and 'C'. The names A, B, and C may be Features or Products. The matrix describes all the correlations between Customer Distribution A, Customer Distribution B, and Customer Distribution C.

The list of 'Input Correlation Pairs' contains additional relationships used to extend (grow) the original 'Input Correlation Matrix'. These input correlation values change the Correlation Matrix in two ways:
(a) Replace existing correlations already found in the Correlation Matrix, and
(b) Add new rows of correlations to the Correlation Matrix.

For example, if the 'Input Correlation Pairs' list contained the relationship ('A', 'B', 0.7) then the node would simply replace the existing (single) correlation value already found in the original Correlation Matrix.

However, if the 'Input Correlation Pairs' list contained the relationship ('A', 'X', 0.5) then the node would add a whole row to the Correlation Matrix with values for A:X, B:X, and C:X. The correlation for the pair A:X would be set to the 0.5 value found in the 'Input Correlation Pairs' relationship. But the correlation values for B:X and C:X would also be set by multiplying 0.5 by the existing correlations for A:B and A:C.

The user does not have to specify each correlation in every new row of the Correlation Matrix - the node will do this automatically. But if the user wishes to specify each correlation value themselves, then they can do so by adding additional rows to the bottom of the 'Input Correlation Pairs' list. The node processes this list of relationships row-by-row, so after adding a new row for [A:X, B:X, C:X] the user could override the calculated correlations for B:X and C:X.

Multiple relationships are used when filling in missing correlation values. For example, if the list contained the relationship ('A', 'X', 0.5) and ('B', 'X', 0.4) then the missing correlation for the pair C:X would be blended from the correlations A:C:X and B:C:X.

Note that the 'Input Correlation Matrix' will be converted into a clean and symmetrical Correlation Matrix when first loaded. That means: (a) the diagonal A:A, B:B, C:C correlations will be set to 1.0; (b) correlation values will be range-limited to between -1.0 and +1.0; (c) missing correlations will be set to 0.0; and (d) the correlation for A:B will be set the same as the correlation for B:A (hence lower-left-triangle and upper-right-triangle correlation matrices can be input).

The correlations found in the list of 'Input Correlation Pairs' are not range-limited when first loaded and can be set outside the [-1.0 to +1.0] limit to boost the correlation with existing entries. But the final correlations found in the output tables will be so range-limited.

The purpose of this node is to allow the user to quickly extend an existing Correlation Matrix with new rows given limited available data. For example, a Willingness To Pay (WTP) Matrix may have been calculated by an upstream node for the user's own Products, and now the user wishes to add Competitive Products to this WTP Matrix. The user may know each Competitive Product is a 'Perfect Match' (correlation = 0.9) or a 'Near Match' (correlation = 0.7) to one of their own Products ('Perfect Match' Products tend not to be perfectly correlated as the buying experience from the Competitor's store will still be different). But the user may not know the correlation between the matched Competitive Product and all of the other Products in the Market. If the matched Product is also the most similar Product, then this node can approximate the correlations to all other Products.

More Help: Examples and sample workflows can be found at the Scientific Strategy website: www.scientificstrategy.com.

## Input Ports Input Correlation Pairs: The input set of correlations as a list of pairs. Each pair should quantify the correlation between a single row and a single column for all unique row-column combinations for the Output Correlation Matrix. The Input Correlation Pairs should include the following columns:
1. From Distribution (string): The name of the first Customer Distribution for a row/column within the Output Correlation Matrix. This name may or may not correspond to a Customer Distribution name in the 'Input Correlation Matrix'.
2. To Distribution (string): The name of the second Customer Distribution for a column/row within the Output Correlation Matrix. This name also may or may not correspond to a Customer Distribution name in the 'Input Correlation Matrix'. If both the 'From Distribution' and the 'To Distribution' names already exist in the 'Input Correlation Matrix' then the 'Correlation' value found in this list will replace the existing value. If just one name exists (typical) then a new row of correlation values will be added to the Correlation Matrix using this 'Correlation' value and the existing correlations already found in the Matrix. If neither name exists then two new rows will be created, but all correlations with existing Customer Distributions will be set to zero.
3. Correlation (double): The degree of correlation between the first Customer Distribution and the Second Customer Distribution. Each relationship in the 'Input Correlation Pairs' is processed row-by-row, so lower correlations values found in the input table will replace earlier correlations. Input Correlation Matrix: The input set of correlations that define the relationship between Customer Distributions of the same name. The Correlation Matrix must be symmetrical such that the number of data rows match the number of columns. Each row Distribution Name should be unique and correspond to a column of the same name. The Input Correlation Matrix should include the following columns:
1. Distribution (string): The name of the Customer Distribution. This name should correspond to a column of the same name in the same Input Correlation Matrix. The Distribution column can have any name. If multiple string columns are found then the first column is treated as the Distribution name column and the other string columns are ignored. If no string columns are found then the RowID column is treated as the Distribution name column.
2. Correlation Values (double): The correlation value between each Customer Distribution row and each Customer Distribution column. As the Correlation Matrix is expected to be symmetrical, each row-column value should be the same as each column-row value. If multiple correlations are provided for A:B or B:A then the highest-non-zero correlation will be used. Left-Lower or Right-Upper triangle matrices can also be used. The diagonal values should all be equal to 1.0.

## Output Ports Output Correlation Matrix: The output set of correlations that define the relationship between all Customer Distributions. The Correlation Matrix will be symmetrical such that the number of data rows match the number of columns. Each row Distribution Name will be unique and correspond to a column of the same name. The Output Correlation Matrix will contain these columns:
1. Distribution: The row name of the first Customer Distribution within the Output Correlation Matrix.
2. Correlated Distributions: The column name of the second Customer Distribution within the Output Correlation Matrix, along with the degree of correlation to the row Customer Distribution. Output correlations will be symmetrical and range-limited to -1.0 and +1.0. Output Correlation Repaired Matrix: The repaired output set of correlations that define the relationship between Customer Distributions. Repairing is required when the correlations are unrealistic. For example, if A is highly correlated to B (for example, A:B = +0.99) and if A is highly correlated with C (for example, A:C = +0.99) then B must be highly correlated with C (that is, B:C >> 0.0). More precisely, the Correlation Matrix must have all positive definite Eigenvalues. Note that it is not necessary for downstream nodes that generate Customer Distributions (such as the Matrix Distributions node or the Feature Generation node) to use this Correlation Repaired Matrix as these downstream nodes will always first self-repair the Input Correlation Matrix. The Output Correlation Repaired Matrix will contain the same columns as the Output Correlation Matrix:
1. Distribution: The row name of the first Customer Distribution within the Output Correlation Repaired Matrix.
2. Correlated Distributions: The column name of the second Customer Distribution within the Output Correlation Matrix, along with the repaired degree of correlation to the row Customer Distribution. Output correlations will be symmetrical and range-limited to -1.0 and +1.0. Output Correlation Error Matrix: The difference between the Output Correlation Matrix and the Output Correlation Repaired Matrix. This is a convenience output to show how the Correlation Matrix needs to be repaired before Customer Distributions can be generated. The Output Correlation Error Matrix will contain the same columns as the Output Correlation Matrix:
1. Distribution: The row name of the first Customer Distribution within the Output Correlation Error Matrix.
2. Correlated Distributions: The column name of the second Customer Distribution within the Output Correlation Matrix, along with the difference between the output correlation and the repaired correlation.

## Views

This node has no views