Mahalanobis Distance

The Mahalanobis Distance is a metric, which measures the distance of two data sets with respect to the variance and covariance of the selected variables.
It is defined as
d (x,y) = ((x-y) T S -1 (x-y)) 1/2
Whereby x and y are two random vectors on the same distribution with the convariance matrix S.

Explanation:
Assume that we have a data set in a 2-dimensional Euclidean space and we want to estimate the probability that a point P1 (x,y) is part of this set. Obviously, the 'closer' the P1 is to the center of mass in the set, the more likely it is contained. Also we have to consider the spread of the data. A Data set with correlated variables will form a ellipse around the center of mass in the 2-dimensional Euclidean space. So the probability that a test point is contained in the set is also depend on the direction of the axis of that ellipse - or ellipsoid in a N-dimensional Euclidean space. The ellipsoid that best represents the set's probability distribution can be estimated by building the covariance matrix of the samples, which is actually used by the Mahalanobis distance.
If the covariance matrix is the identity matrix the variables of the data set are not correlated and the Mahalanobis distance reduces to the Euclidean distance.

Use case:
A typical use case is the outlier detection. These are intuitively points with a very high Mahalanobis distance in contrast to points in the data set.

Options

Column Selection
Choose the numeric columns for which the distance is defined.

Input Ports

Icon
Input data.
Icon
Optional covariance input table. The matrix must be quadratic and have identical column/row pairs. If unconnected the covariance matrix is computed on the selected input columns.

Output Ports

Icon
The configured distance.
Icon
The computed covariance matrix.

Views

This node has no views

Workflows

Further Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.