Icon

00_​Topic_​Modeling_​with_​Verified_​Components

Topic Modeling with Verified Components

In this workflow you can see how these 4 components can be adopted in combination with the Topic Extractor (Parallel LDA) node.
Right click on the Topic Explorer Views to open the interactive views.

Read more on the component pages at knime.com/verified-components. References are also available below in this workflow page.



04 Interactive Views Scoring 03 Training Interactive Views Predictor 02 Documents Data Preparation Topic Modeling with Verified Components Four components are executed on different settings and modes. This should help with an overview of how these components can be adopted in different scenarios: - STM model or LDA model - scoring one or multiple models topics - visualizing default topic labels or automated/custom labels Note: make sure to open the views of the components in the blue annotations. 01 Documents Data Preparation 05 Apply the models again on 'unseen' data adapted from: Topic Modeling Spacehttps://hub.knime.com/-/spaces/-/latest/~LHP4VTo-KWY99YYv/ 06 Train with additional Meta Data 07 Interactive Views - assign Lables automatically and individually ChatGPT says:Exclusivity:Definition: Exclusivity measures how exclusive a word is to a particular topic. In other words, it gauges how much a word appears in one topic compared to its appearance in other topics.Interpretation: A higher exclusivity score for a word in a topic means that the word is more specific to that topic and less likely to appear in other topics. This can help in distinguishing topics from oneanother.Cutoff: If you're using exclusivity as a metric to determine the best number of topics, you might look for a balance where topics have a good number of exclusive words but are still interpretable. Toohigh exclusivity might lead to very narrow topics, while too low might make topics overlap significantly.Semantic Coherence:Definition: Semantic coherence evaluates how top words of a topic co-occur in the corpus. It gives an idea of how meaningful and interpretable a topic is.Interpretation: Higher semantic coherence indicates that the top words of the topic frequently appear together in the documents, suggesting that the topic is coherent and likely meaningful.Cutoff: A higher semantic coherence is generally desirable. When determining the number of topics, you might look for a point where adding more topics doesn't significantly improve coherence oreven reduces it.Heldout Likelihood:Definition: This is a measure of how well the topic model predicts unseen documents. A portion of the corpus is "held out" or set aside, and the trained model's likelihood on this held-out set iscomputed.Interpretation: A higher heldout likelihood indicates that the model generalizes well to unseen documents.Cutoff: You'd typically want a higher heldout likelihood. When selecting the number of topics, you might look for a point where the likelihood starts to plateau or decrease, indicating overfitting.Residual Variance:Definition: In the context of topic modeling, residual variance can refer to the unexplained variance in the data after accounting for the topics. It's a measure of how much information the topics aren'tcapturing.Interpretation: Lower residual variance indicates that the topics capture more of the structure in the data.Cutoff: You'd generally want a lower residual variance. When determining the number of topics, you might look for a point where the decrease in residual variance slows down, suggestingdiminishing returns from adding more topics. decide: optimal K search?k = 690/10splitadd ModelIDmanually assignTopic namesNode 1781prepareINSERT_HERE_PROPER_LABELthe optimal searchis demanding. You mighthave to start with a smaller numberof linesexplore STM topicswith automated labelson train corpusscore multiple models on holdout corpusscore LDA on train corpusscore STM on holdout corpuscreate"Preprocessed Document"topic_preprocessing.table../data/topic_preprocessing.tableholdout predsuse Preprocessed Documents../model/r_model_topics.zip../model/r_model_topics.zip../model/meta_label_assigner.tablejoin theautomatedand new labelsmeta_label_assigner.tablek = 6no metadatak = 10source + s(day)+ B-Spline Basis Function!!! use Preprocessed Documentsexplore STM topicswith automated labelson train corpusexplore STM topicson holdout corpus../model/meta_topic_models_terms_weight.tableexplore STM topicswith automated labelson train corpus../model/r_model_topics_meta.zipmanually assignTopic namesprepareINSERT_HERE_PROPER_LABEL../model/meta_label_assigner_meta.tableNode 1823explore STM topicswith automated labelson train corpusexplore STM topicswith automated labelson train corpusholdout predsuse Preprocessed Documents../model/meta_topic_models_terms_meta_weight.tableTopic Extractor(STM) Topic Extractor(Parallel LDA) Partitioning ConstantValue Column Concatenate Table Editor Rename Topics witha Custom Label ConstantValue Column Row Sampling Topic Explorer View Topic Scorer (Labs) Topic Scorer (Labs) Topic Scorer (Labs) Read andPrepare Docs Table Writer Table Reader Topic Assigner(STM) R Model Writer R Model Reader Table Writer Joiner Table Reader Topic Extractor(STM) Topic Extractor(STM) Topic Explorer View Topic Explorer View Table Writer Topic Explorer View R Model Writer Table Editor ConstantValue Column Table Writer Rename Topics witha Custom Label Topic Explorer View Topic Explorer View Topic Assigner(STM) Table Writer 04 Interactive Views Scoring 03 Training Interactive Views Predictor 02 Documents Data Preparation Topic Modeling with Verified Components Four components are executed on different settings and modes. This should help with an overview of how these components can be adopted in different scenarios: - STM model or LDA model - scoring one or multiple models topics - visualizing default topic labels or automated/custom labels Note: make sure to open the views of the components in the blue annotations. 01 Documents Data Preparation 05 Apply the models again on 'unseen' data adapted from: Topic Modeling Spacehttps://hub.knime.com/-/spaces/-/latest/~LHP4VTo-KWY99YYv/ 06 Train with additional Meta Data 07 Interactive Views - assign Lables automatically and individually ChatGPT says:Exclusivity:Definition: Exclusivity measures how exclusive a word is to a particular topic. In other words, it gauges how much a word appears in one topic compared to its appearance in other topics.Interpretation: A higher exclusivity score for a word in a topic means that the word is more specific to that topic and less likely to appear in other topics. This can help in distinguishing topics from oneanother.Cutoff: If you're using exclusivity as a metric to determine the best number of topics, you might look for a balance where topics have a good number of exclusive words but are still interpretable. Toohigh exclusivity might lead to very narrow topics, while too low might make topics overlap significantly.Semantic Coherence:Definition: Semantic coherence evaluates how top words of a topic co-occur in the corpus. It gives an idea of how meaningful and interpretable a topic is.Interpretation: Higher semantic coherence indicates that the top words of the topic frequently appear together in the documents, suggesting that the topic is coherent and likely meaningful.Cutoff: A higher semantic coherence is generally desirable. When determining the number of topics, you might look for a point where adding more topics doesn't significantly improve coherence oreven reduces it.Heldout Likelihood:Definition: This is a measure of how well the topic model predicts unseen documents. A portion of the corpus is "held out" or set aside, and the trained model's likelihood on this held-out set iscomputed.Interpretation: A higher heldout likelihood indicates that the model generalizes well to unseen documents.Cutoff: You'd typically want a higher heldout likelihood. When selecting the number of topics, you might look for a point where the likelihood starts to plateau or decrease, indicating overfitting.Residual Variance:Definition: In the context of topic modeling, residual variance can refer to the unexplained variance in the data after accounting for the topics. It's a measure of how much information the topics aren'tcapturing.Interpretation: Lower residual variance indicates that the topics capture more of the structure in the data.Cutoff: You'd generally want a lower residual variance. When determining the number of topics, you might look for a point where the decrease in residual variance slows down, suggestingdiminishing returns from adding more topics. decide: optimal K search?k = 690/10splitadd ModelIDmanually assignTopic namesNode 1781prepareINSERT_HERE_PROPER_LABELthe optimal searchis demanding. You mighthave to start with a smaller numberof linesexplore STM topicswith automated labelson train corpusscore multiple models on holdout corpusscore LDA on train corpusscore STM on holdout corpuscreate"Preprocessed Document"topic_preprocessing.table../data/topic_preprocessing.tableholdout predsuse Preprocessed Documents../model/r_model_topics.zip../model/r_model_topics.zip../model/meta_label_assigner.tablejoin theautomatedand new labelsmeta_label_assigner.tablek = 6no metadatak = 10source + s(day)+ B-Spline Basis Function!!! use Preprocessed Documentsexplore STM topicswith automated labelson train corpusexplore STM topicson holdout corpus../model/meta_topic_models_terms_weight.tableexplore STM topicswith automated labelson train corpus../model/r_model_topics_meta.zipmanually assignTopic namesprepareINSERT_HERE_PROPER_LABEL../model/meta_label_assigner_meta.tableNode 1823explore STM topicswith automated labelson train corpusexplore STM topicswith automated labelson train corpusholdout predsuse Preprocessed Documents../model/meta_topic_models_terms_meta_weight.tableTopic Extractor(STM) Topic Extractor(Parallel LDA) Partitioning ConstantValue Column Concatenate Table Editor Rename Topics witha Custom Label ConstantValue Column Row Sampling Topic Explorer View Topic Scorer (Labs) Topic Scorer (Labs) Topic Scorer (Labs) Read andPrepare Docs Table Writer Table Reader Topic Assigner(STM) R Model Writer R Model Reader Table Writer Joiner Table Reader Topic Extractor(STM) Topic Extractor(STM) Topic Explorer View Topic Explorer View Table Writer Topic Explorer View R Model Writer Table Editor ConstantValue Column Table Writer Rename Topics witha Custom Label Topic Explorer View Topic Explorer View Topic Assigner(STM) Table Writer

Nodes

Extensions

Links