Topic Extractor (Parallel LDA)

Simple parallel threaded implementation of LDA, following Newman, Asuncion, Smyth and Welling, Distributed Algorithms for Topic Models JMLR (2009), with SparseLDA sampling scheme and data structure from Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009).

The node uses the "MALLET: A Machine Learning for Language Toolkit." topic modeling library. Note: The current version of MALLET contains a known multi-threading bug that can cause the node to fail with an ArrayIndexOutOfBoundsException. Should you encounter this issue, setting the number of threads to one should solve the problem.

Options

Document column: The column that contains the pre-processed document.
Seed: The seed used for random number drawing.
No of topics: The number of topics to detect.
No of words per topic: The number of top words to extract per topic.
No of iterations: Number of iterations to perform (influences the runtime of the algorithm).
Alpha: The alpha parameter defines the Dirichlet prior on the per-document topic distributions. It defines the prior weight of topic k in a document. The library uses the given alpha for all topics. Normally a number less than 1, e.g. 0.1, to prefer sparse topic distributions, i.e. few topics per document.
Beta: The beta parameter defines the prior on per-topic multinomial distribution over words. It defines the prior weight of word w in a topic. The library uses the given beta for all words. Normally a number much less than 1, e.g. 0.001, to strongly prefer sparse word distributions, i.e. few words per topic.
No of threads: Divides the input document collection into the specified number of threads and merges the calculated statistics afterwards.

Input Ports

: Data table with the document collection to analyze. Each row contains one document.

Output Ports

: The document collection with topic assignments and the probability for each document to belong to a certain topic
: The topic models with the terms and their weight per topic
: Table with statistics for each iteration

Popular Predecessors

Popular Successors

Views

This node has no views

Workflows

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME Textprocessing from the below update site following our NodePit Product and Node Installation Guide:

v5.5

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.5.0.v202412191419

On NodePit since: 2025-07-02

Last update: 2025-07-11

KNIME versions: Since v3.6

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!