t-SNE (L. Jonsson)

t-SNE is a manifold learning technique, which learns low dimensional embeddings for high dimensional data. It is most often used for visualization purposes because it exploits the local relationships between datapoints and can subsequently capture nonlinear structures in the data. Unlike other dimension reduction techniques like PCA, a learned t-SNE model can't be applied to new data. The t-SNE algorithm can be roughly summarized as two steps:

Create a probability distribution capturing the relationships between points in the high dimensional space
Find a low dimensional space that resembles the probability dimension as well as possible

For further details check out this blog post or the original paper . The implementation of this node is based on T-SNE-Java by Leif Jonsson.

Disclaimer:

Depending on the size of the input table, the computation of t-SNE can be very expensive both in terms of runtime as well as memory. If you experience problems with memory, try to reduce the size of your data by e.g. using the Row Sampling node. If you have very high-dimensional data, it is also advisable to first reduce the number of dimensions to around 50 using e.g. a PCA.

Options

Columns: Select the columns that are included by t-SNE i.e. the original features. Note that currently only numerical columns are supported.
Dimension(s) to reduce to: The number of dimension of the target embedding (for visualization typically 2 or 3).
Iterations: The number of learning iterations to be performed. Too few iterations might result in a bad embedding, while too many iterations take a long time to train.
Theta: Controls the tradeoff between runtime and accuracy of the Barnes-Hut approximation algorithm for t-SNE. Lower values result in a more accurate approximation at the cost of higher runtimes and memory demands. A theta of zero results in the originally proposed t-SNE algorithm. However, for most datasets a theta of 0.5 does not result in a perceivable loss of quality.
Perplexity: Informally, the perplexity is the number of neighbors for each datapoint. Small perplexities focus more on local structure while larger perplexities take more global relationships into account. I most cases values in range [5,50] are sufficient.
Note: The perplexity must be less than or equal to (Number of rows - 1) / 3.
Number of threads: Number of threads used for parallel computation. The default is set to the number of cores your computer has and usually doesn't require tuning. Note that no parallelization is used if theta is zero because the exact t-SNE algorithm isn't parallelizable.
Remove original data columns: Check this box if you want to remove the columns used to learn the embedding.
Fail if missing values are encountered: If this box is checked, the node fails if it encounters a missing value in one of the columns used for learning. Otherwise, rows containing missing values in the learning columns will be ignored during learning and the corresponding embedding consists of missing values.
Seed: Allows you to specify a static seed to enable reproducible results.

Input Ports

: Input port for the data for which a low dimensional embedding should be learned

Output Ports

: The low dimensional embedding

Popular Predecessors

Popular Successors

Views

This node has no views

Workflows

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME Statistics Nodes (Labs) from the below update site following our NodePit Product and Node Installation Guide:

v5.4

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.4.0.v202407231410

On NodePit since: 2024-12-06

Last update: 2025-07-01

KNIME versions: Since v4.0

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!