Icon

Topic Models from Reviews_​BERTopic

<p><strong>Topic Models from Reviews using BERT</strong></p><p>This workflow addresses the problem of extracting and modeling topics from reviews, using advanced model.</p><p></p><p>It begins by performing data ingestion and preparation for the analysis.</p><p>Then the BERTopic model is applied with custom parameters and displays the topic probabilities along with the average number of stars by topic.</p><p>Finally the workflow presents a series of interpretive visualizations. </p><p></p><p><strong>Reference:</strong> F. Villaroel Ordenes &amp; R. Silipo, “Machine learning for marketing on the KNIME Hub: The development of a live repository for marketing applications”, <em>Journal of Business Research</em> 137(1):393-410, DOI: 10.1016/j.jbusres.2021.08.036.</p>

URL: 10.1016/j.jbusres.2021.08.036 http://10.1016/j.jbusres.2021.08.036

Topic Extractor (BERTopic)

This node serves as the core of the analysis, identifying hidden themes within the text data using the BERTopic algorithm.

  • Dimensionality Reduction: Employs UMAP with cosine distance metric to project high-dimensional text embeddings into a 2D space.

  • Clustering (HDBSCAN): Groups similar documents into topics. Note that Euclidean distance is used here instead of Cosine; this is because HDBSCAN’s high-performance algorithms require a Minkowski metric. Direct Cosine distance is not mathematically compatible with these acceleration structures, which are necessary to efficiently calculate cluster density and document probabilities.

  • Keyword Optimization: Applies Maximal Marginal Relevance (MMR) with a diversity score of 0.3 to extract the most representative and non-redundant keywords for each topic.

Analysis Nodes

The output is branched into several nodes for interpretation and visualization:

  • GroupBy (Summary words): Concatenates the top 10 words per topic to provide a readable summary of the identified themes.

  • Bubble Chart UMAP: Generates an interactive visualization where each colour represents a topic and the size determines the number of terms of the document; its position in the 2D space reflects semantic similarity to other topics. If the Topic Extractor is re-run and the number of topics changes, the Color Manager will lose its configuration for the new Topic IDs. This will cause the Bubble Chart to fail.

    To fix: Open the Color Manager node, click 'OK' to refresh the color mapping, and re-execute the downstream nodes.

  • Hierarchical View: Produces a dendrogram to visualize the relationships and hierarchy between topics, showing how specific clusters (like "Front Desk") relate to broader categories.

  • Topic View / Comparison: Allows for the comparison of topic distributions across different segments, such as individual hotels. NOTE: change topics name in the node before running it

Doc Creation
TripAdvisor 2 hotels
Excel Reader
Topic Comparison Per Hotel
Topic View
Bubble Chart UMAP
Preprocessing
Hierarchical View
Topic Extractor (BERTopic)
Summary words per topic
GroupBy

Nodes

Extensions

Links