Icon

01_​Structural_​Topic_​Model_​(STM)_​Example

Structural Topic Modelling (STM) via Verified Components

This example shows how to adopt the verified components Topic Extractor (STM) and the Topic Assigner (STM).

The main difference with using the Topic Extractor (Parallel LDA) node is that also the document metadata can be provided during training.

The component adopts the R library 'stm' and requires you to install conda to automatically install the R and the required libraries.

Find more info about the R library, the KNIME R Integration and the Verified Components documentation in the links below.

Training Predictor Documents Data Preparation Structural Topic Modelling (STM) via Verified Components This example shows how to adopt the verified components Topic Extractor (STM) and the Topic Assigner (STM). The main difference with using the Topic Extractor (Parallel LDA) node is that also the document metadata can be provided during training. The component adopts the R library 'stm' and requires you to install conda to automatically install the R and the required libraries. Interactive Views ChatGPT says:Exclusivity:Definition: Exclusivity measures how exclusive a word is to a particular topic. In other words, it gauges how much a word appears in one topic compared to its appearance in other topics.Interpretation: A higher exclusivity score for a word in a topic means that the word is more specific to that topic and less likely to appear in other topics. This can help in distinguishing topics from oneanother.Cutoff: If you're using exclusivity as a metric to determine the best number of topics, you might look for a balance where topics have a good number of exclusive words but are still interpretable. Toohigh exclusivity might lead to very narrow topics, while too low might make topics overlap significantly.Semantic Coherence:Definition: Semantic coherence evaluates how top words of a topic co-occur in the corpus. It gives an idea of how meaningful and interpretable a topic is.Interpretation: Higher semantic coherence indicates that the top words of the topic frequently appear together in the documents, suggesting that the topic is coherent and likely meaningful.Cutoff: A higher semantic coherence is generally desirable. When determining the number of topics, you might look for a point where adding more topics doesn't significantly improve coherence oreven reduces it.Heldout Likelihood:Definition: This is a measure of how well the topic model predicts unseen documents. A portion of the corpus is "held out" or set aside, and the trained model's likelihood on this held-out set iscomputed.Interpretation: A higher heldout likelihood indicates that the model generalizes well to unseen documents.Cutoff: You'd typically want a higher heldout likelihood. When selecting the number of topics, you might look for a point where the likelihood starts to plateau or decrease, indicating overfitting.Residual Variance:Definition: In the context of topic modeling, residual variance can refer to the unexplained variance in the data after accounting for the topics. It's a measure of how much information the topics aren'tcapturing.Interpretation: Lower residual variance indicates that the topics capture more of the structure in the data.Cutoff: You'd generally want a lower residual variance. When determining the number of topics, you might look for a point where the decrease in residual variance slows down, suggestingdiminishing returns from adding more topics. adapted from: Topic Modeling Spacehttps://hub.knime.com/-/spaces/-/latest/~LHP4VTo-KWY99YYv/ k = 690/10 split750 textsholdoutNode 1802top term topicstrain../data/topic_preprocessing.tableoptimal K search"Exclusivity"holdout predsuse Preprocessed DocumentsTopic Extractor(STM) Partitioning Row Sampling Document Viewer Column Filter Tag Cloud Document Viewer Table Reader Topic Extractor(STM) Topic Assigner(STM) Training Predictor Documents Data Preparation Structural Topic Modelling (STM) via Verified Components This example shows how to adopt the verified components Topic Extractor (STM) and the Topic Assigner (STM). The main difference with using the Topic Extractor (Parallel LDA) node is that also the document metadata can be provided during training. The component adopts the R library 'stm' and requires you to install conda to automatically install the R and the required libraries. Interactive Views ChatGPT says:Exclusivity:Definition: Exclusivity measures how exclusive a word is to a particular topic. In other words, it gauges how much a word appears in one topic compared to its appearance in other topics.Interpretation: A higher exclusivity score for a word in a topic means that the word is more specific to that topic and less likely to appear in other topics. This can help in distinguishing topics from oneanother.Cutoff: If you're using exclusivity as a metric to determine the best number of topics, you might look for a balance where topics have a good number of exclusive words but are still interpretable. Toohigh exclusivity might lead to very narrow topics, while too low might make topics overlap significantly.Semantic Coherence:Definition: Semantic coherence evaluates how top words of a topic co-occur in the corpus. It gives an idea of how meaningful and interpretable a topic is.Interpretation: Higher semantic coherence indicates that the top words of the topic frequently appear together in the documents, suggesting that the topic is coherent and likely meaningful.Cutoff: A higher semantic coherence is generally desirable. When determining the number of topics, you might look for a point where adding more topics doesn't significantly improve coherence oreven reduces it.Heldout Likelihood:Definition: This is a measure of how well the topic model predicts unseen documents. A portion of the corpus is "held out" or set aside, and the trained model's likelihood on this held-out set iscomputed.Interpretation: A higher heldout likelihood indicates that the model generalizes well to unseen documents.Cutoff: You'd typically want a higher heldout likelihood. When selecting the number of topics, you might look for a point where the likelihood starts to plateau or decrease, indicating overfitting.Residual Variance:Definition: In the context of topic modeling, residual variance can refer to the unexplained variance in the data after accounting for the topics. It's a measure of how much information the topics aren'tcapturing.Interpretation: Lower residual variance indicates that the topics capture more of the structure in the data.Cutoff: You'd generally want a lower residual variance. When determining the number of topics, you might look for a point where the decrease in residual variance slows down, suggestingdiminishing returns from adding more topics. adapted from: Topic Modeling Spacehttps://hub.knime.com/-/spaces/-/latest/~LHP4VTo-KWY99YYv/ k = 690/10 split750 textsholdoutNode 1802top term topicstrain../data/topic_preprocessing.tableoptimal K search"Exclusivity"holdout predsuse Preprocessed DocumentsTopic Extractor(STM) Partitioning Row Sampling Document Viewer Column Filter Tag Cloud Document Viewer Table Reader Topic Extractor(STM) Topic Assigner(STM)

Nodes

Extensions

Links