Topic Scorer (Labs)

This component can compute different metrics of topics created by the Topic Extractor (Parallel LDA) node and Topic Extractor (STM) component. We list below the metrics it can score provided a table or pre-processed documents and a table of weighted terms for each topic. Provide the topics of a single model or of multiple models.

Take a look at the example workflows at the bottom of this page to learn how to concatenate topics from different models trained on the same corpus of documents or add a ‘model ID’ to the output of the Topic Extractor (Parallel LDA) node.

DISCLAIMER: this verified component is currently marked as part of KNIME Labs (knime.com/knime-labs). Provide feedback at upskilling@knime.com

Topic Semantic Coherence score:
This component calculates semantic coherence scores for each topic. Semantic coherence measures how coherent topics are by checking if the topics top terms appear together in the same documents more often than not. This experimental implementation is based on the paper by Mimno et al (2011) [dl.acm.org/doi/10.5555/2145432.2145462].

Topic Exclusivity score:
This component calculates the exclusivity of topics. Exclusivity is computed using an experimental implementation of the FREX function by Bischof and Airoldi (2012) [dl.acm.org/doi/10.5555/3042573.3042578]. FREX does not take in consideration only how exclusive/unique terms are between different topics (top terms table), but also how rare those topics are in documents of the same topic (documents table).

When comparing multiple models, documents can be assigned by different models to different topics and therefore exclusivity can be computed only using how unique terms are in the topics top terms table. Read more in the setting “Ignore Assigned Topic Column” description.

Topic Neighbor Distance score:
This component computes an experimental distance between topics within the same model or between several models. To do this, topics are represented by a normalized vector by pivoting the top terms by topic table. A cosine distance between topic vectors is computed. For each topic the distance is used to show the closest and farthest topic within one or between more models.

Options

Normalize Weight by Model: Apply a normalization to ensure all models have weights between 0 and 1.
Ignore Assigned Topic Column: Check this box to score multiple topic models and/or using a partition of documents where the assigned topic is unknown. DISCLAIMER: if you check this box you still need to select a random string column below.%%00010%%00010The component can still compute these metrics without considering how frequent topic terms are within partitions of documents of the same topic. The semantic coherence and the neighbor distance metric won't be affected by this setting. The exclusivity is going to return a score based only on how exclusive/unique terms are between topics, regardless if they are rare words or not. That is a FREX score with the weight parameter (w) equal to 0.%%00010
Select Topic ID: Select the topic ID column. Default is the usual name provided by the Topic Extractor (Parallel LDA) node or the Topic Extractor (STM) component second table output.
Select Model ID: Select the model ID column. This is already provided by the Topic Extractor (STM) component. You can add it to the Topic Extractor node via a Constant Value Column node.
Select Term: Select the term column. Default is the usual name provided by the Topic Extractor (Parallel LDA) node or the Topic Extractor (STM) component second table output.
Select Weight: Select the weight column. Default is the usual name provided by the Topic Extractor (Parallel LDA) node or the Topic Extractor (STM) component second table output.
Select Assigned Topics to Documents: Select the column with the topic assigned to each document.
Select Document: Select the column with preprocessed documents. Apply the Strings to Document node and any other preprocessing required (stopwords removal, stemming, ...) before this component. These nodes can be found in the KNIME Textprocessing Extension.
FREX Weight (w): A value between 0 and 1. DISCLAIMER: This setting is ignored and hardcoded to 0 if the box "Ignore Assigned Topic Column" above is checked.%%00010%%00010If 0 is provided FREX is going to return a score only based on how terms are unique/exclusive between topics. If 1 is provided FREX is going to return a score only based on how terms are rare within the documents of that topic.%%00010
Number of Top Terms for Coherence: This setting controls how many top terms per topic from the second input to include in the semantic coherence computation.
Number of Top Terms for Neighbor Distances: This setting controls how many top terms per topic from the second input to include in the neighbor distance computation.
Number of Top Terms for Exclusivity: This setting controls how many top terms per topic from the second input to include in the exclusivity (FREX) computation.

Input Ports

: The pre-processed documents from the corpus used to train the topic model. They can be the ones used in training or a hold-out sample. The documents should be in the KNIME Textprocessing format (use the Strings to Document node).
: The topics top term table created either by the Topic Extractor (Parallel LDA) node or the Topic Extractor (STM) component (second table output). This table should list the top weighted terms for each topic for one or different models. If you concatenate topics from different models make sure to add a column for the model ID.

Output Ports

: For each topic the measured metrics in a table.

Topic Scorer (Labs)

Options

Input Ports

Output Ports

Nodes

Extensions

Links

Download