Icon

Topic Models from Reviews_​BERTopic

<p><strong>Topic Models from Reviews using BERT</strong></p><p>This workflow addresses the problem of extracting and modeling topics from reviews, using advanced model.</p><p></p><p>It begins by performing data ingestion and preparation for the analysis.</p><p>Then the BERTopic model is applied with custom parameters and displays the topic probabilities along with the average number of stars by topic.</p><p>Finally the workflow presents a series of interpretive visualizations. </p><p></p><p><strong>Reference:</strong> F. Villaroel Ordenes &amp; R. Silipo, “Machine learning for marketing on the KNIME Hub: The development of a live repository for marketing applications”, <em>Journal of Business Research</em> 137(1):393-410, DOI: 10.1016/j.jbusres.2021.08.036.</p>

URL: 10.1016/j.jbusres.2021.08.036 http://10.1016/j.jbusres.2021.08.036

Topic Extractor (BERTopic)

This node serves as the core of the analysis, identifying hidden themes within the text data using the BERTopic algorithm.

  • Dimensionality Reduction: Employs UMAP with cosine distance metric to project high-dimensional text embeddings into a 2D space.

  • Clustering (HDBSCAN): Groups similar documents into topics. Note that Euclidean distance is used here instead of Cosine; this is because HDBSCAN’s high-performance algorithms require a Minkowski metric. Direct Cosine distance is not mathematically compatible with these acceleration structures, which are necessary to efficiently calculate cluster density and document probabilities.

  • Keyword Optimization: Applies Maximal Marginal Relevance (MMR) with a diversity score of 0.3 to extract the most representative and non-redundant keywords for each topic.

Analysis Nodes

The output is branched into several nodes for interpretation and visualization:

  • GroupBy (Summary words): Concatenates the top 10 words per topic to provide a readable summary of the identified themes.

  • Bubble Chart UMAP: Generates an interactive visualization where each colour represents a topic and the size determines the number of terms of the document; its position in the 2D space reflects semantic similarity to other topics. If the Topic Extractor is re-run and the number of topics changes, the Color Manager will lose its configuration for the new Topic IDs. This will cause the Bubble Chart to fail.

    To fix: Open the Color Manager node, click 'OK' to refresh the color mapping, and re-execute the downstream nodes.

  • Hierarchical View: Produces a dendrogram to visualize the relationships and hierarchy between topics, showing how specific clusters (like "Front Desk") relate to broader categories.

  • Topic View / Comparison: Allows for the comparison of topic distributions across different segments, such as individual hotels. NOTE: change topics name in the node before running it

Doc Creation
TripAdvisor 2 hotels
Excel Reader
Topic Comparison Per Hotel
Topic View
Bubble Chart UMAP
Preprocessing
Hierarchical View
Topic Extractor (BERTopic)
Summary words per topic
GroupBy
Create new Review Text column

Nodes

Extensions

Links