Topic Extractor (BERTopic)
This node serves as the core of the analysis, identifying hidden themes within the text data using the BERTopic algorithm.
Dimensionality Reduction: Employs UMAP with cosine distance metric to project high-dimensional text embeddings into a 2D space.
Clustering (HDBSCAN): Groups similar documents into topics. Note that Euclidean distance is used here instead of Cosine; this is because HDBSCAN’s high-performance algorithms require a Minkowski metric. Direct Cosine distance is not mathematically compatible with these acceleration structures, which are necessary to efficiently calculate cluster density and document probabilities.
Keyword Optimization: Applies Maximal Marginal Relevance (MMR) with a diversity score of 0.3 to extract the most representative and non-redundant keywords for each topic.