t-SNE (L. Jonsson)

t-SNE is a manifold learning technique, which learns low dimensional embeddings for high dimensional data. It is most often used for visualization purposes because it exploits the local relationships between datapoints and can subsequently capture nonlinear structures in the data. Unlike other dimension reduction techniques like PCA, a learned t-SNE model can't be applied to new data. The t-SNE algorithm can be roughly summarized as two steps:

  1. Create a probability distribution capturing the relationships between points in the high dimensional space
  2. Find a low dimensional space that resembles the probability dimension as well as possible
For further details check out this blog post or the original paper . The implementation of this node is based on T-SNE-Java by Leif Jonsson.

Disclaimer:

Depending on the size of the input table, the computation of t-SNE can be very expensive both in terms of runtime as well as memory. If you experience problems with memory, try to reduce the size of your data by e.g. using the Row Sampling node. If you have very high-dimensional data, it is also advisable to first reduce the number of dimensions to around 50 using e.g. a PCA.

Options

Columns
Select the columns that are included by t-SNE i.e. the original features. Note that currently only numerical columns are supported.
Dimension(s) to reduce to
The number of dimension of the target embedding (for visualization typically 2 or 3).
Iterations
The number of learning iterations to be performed. Too few iterations might result in a bad embedding, while too many iterations take a long time to train.
Theta
Controls the tradeoff between runtime and accuracy of the Barnes-Hut approximation algorithm for t-SNE. Lower values result in a more accurate approximation at the cost of higher runtimes and memory demands. A theta of zero results in the originally proposed t-SNE algorithm. However, for most datasets a theta of 0.5 does not result in a perceivable loss of quality.
Perplexity
Informally, the perplexity is the number of neighbors for each datapoint. Small perplexities focus more on local structure while larger perplexities take more global relationships into account. I most cases values in range [5,50] are sufficient.
Note: The perplexity must be less than or equal to (Number of rows - 1) / 3.
Number of threads
Number of threads used for parallel computation. The default is set to the number of cores your computer has and usually doesn't require tuning. Note that no parallelization is used if theta is zero because the exact t-SNE algorithm isn't parallelizable.
Remove original data columns
Check this box if you want to remove the columns used to learn the embedding.
Fail if missing values are encountered
If this box is checked, the node fails if it encounters a missing value in one of the columns used for learning. Otherwise, rows containing missing values in the learning columns will be ignored during learning and the corresponding embedding consists of missing values.
Seed
Allows you to specify a static seed to enable reproducible results.

Input Ports

Icon
Input port for the data for which a low dimensional embedding should be learned

Output Ports

Icon
The low dimensional embedding

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.