Topic Explorer View

This component serves the purpose of visually representing and analyzing the outcomes of a topic model. It is compatible with any topic modeling model as long as they generate the topic-term matrix and the topic-document matrix. We recommend using this component downstream from the Topic Extractor (Parallel LDA) node [kni.me/n/w7Vr1wY8Bu8Gfpv7] or the Topic Extractor (STM) component [kni.me/c/DFANPa0NHnZb9tSV]. For more details see port documentation below.

The component interactive view proves valuable in validating a chosen topic model solution and offering insights into the similarity between different extracted topics.

The Topic Explorer View offers two modes:

- Explore by Topic: explore the topics (second input) in a similarity bubble chart, select topics and visualize coherence and exclusivity scores from the Topic Scorer component (kni.me/c/5_W2h2g6hBY_M0Bc) and the associated tag cloud. Additionally you can scroll through topics represented as small bar charts.

- Explore by Document: explore the documents (first input) in a similarity bubble chart, select topics and visualize the preview or the full length of documents where the terms inside the topics are highlighted.

Both modes provide a similarity bubble chart, where topics or documents with higher semantic similarity are positioned closer to each other on the graph in 2-dimensional space. This is achieved through a combination of distinct analytics techniques:

1) For the “Explore by Topic” mode, we utilize a Word2Vec model (kni.me/n/QPMbC4vyfvPkfV8F) to calculate the distances between all words within the documents. These distances are then used to construct a distance matrix, representing the similarity among all topics by averaging the distances of the words associated with each specific topic.

2) The distance matrix generated by Word2Vec is further processed using Multidimensional Scaling (MDS) (kni.me/n/SCgPuzvfM-9t325D), which decomposes it into two dimensions. These two dimensions serve as the coordinates of each topic in a 2-dimensional space. Additionally, the size of the points representing topics directly corresponds to their frequency among the documents.

3) The size of the bubble represents the mean probability of input documents to belong to that topic.

4) When adopting the “Explore by Document” mode, each bubble represent a different document as we perform a similar approach using the documents bag of words instead of the topic models output terms

DISCLAIMER: When dealing with a large number of documents this data app slows down in performance. By default the top 250 rows from the top input and the top 10 terms per topic from the second input are considered. You can increase these numbers in the component dialogue. To not face performance issues, it is advisable to employ stratified sampling on the first input using the assigned topic column in a Row Sampling node (kni.me/n/3o-UY2qMENf5piCd) before the component.

This component can be utilized as a data app, running either on a local environment or on KNIME Server and KNIME Business Hub.

Options

Automated Topic Rename Using Top Terms
If you activate the automated renaming the current topic labels will be disregarded and the topics will be renamed using the top terms for each topic. You can increase how many top terms will be adopted for the renaming here.
Select Document
Select the document column. This should not be pre-processed (no stopwords-removal, stemming, lemmatization, ..) as documents should be readable in the view.
Select Topic ID
Select the topic ID column. Default is the usual name provided by the Topic Extractor (Parallel LDA) node or the Topic Extractor (STM) component second table output.
Select Term
Select the term column. Default is the usual name provided by the Topic Extractor (Parallel LDA) node or the Topic Extractor (STM) component second table output.
Select Term Weight
Select the weight column. Default is the usual name provided by the Topic Extractor (Parallel LDA) node or the Topic Extractor (STM) component second table output.
Select Predicted Topic for Documents
Select the assigned topic column. Default is based on the column "Assigned Topic" from Topic Extractor node/component.
Word2Vec Number of Embeddings
Enter the desired number of neurons for the Word2Vec algorithm layer size.
Word2Vec Context Window Size
Define words context for the Word2Vec algorithm.
Word2Vec Minimum Word Frequency
The minimum frequency to consider a word for the Word2Vec algorithm.
Max Number of Documents
The visualization requires heavy pre-computation. This parameter controls how many documents from the top of the table are going to be processed and visualized from the first input.
Number of Top Terms per Topic
Filters the number of terms per topic from the second input.
Size of Text for Documents Overview
To overview many document headers only the top x characters are displayed in the view. Increase or decrease this parameter.
Number of Terms to Be Used for Renaming
If you activated the automated renaming the current topic labels will be disregarded and the topics will be renamed using the top terms for each topic. You can increase how many top terms will be adopted for the renaming here.
Word2Vec Architecture
Select the architecture of the neural network used to predict the context word for a given target word. The main difference is that Skip-Gram uses the current word to predict its neighbors, while COBW uses context to predict the current word.

Input Ports

Icon
Table with the document column (not necessarily pre-processed for readability) and the topic label assigned by the model. This is usually the first output from the "Topic Extractor (Parallel LDA)" node or "Topic Extractor (STM)" component. Each row should be a document and the following columns should be available: the document column (from KNIME Textprocessing), the assigned topic, the probability columns for each topic.
Icon
Second output table from "Topic Extractor (Parallel LDA)" node or "Topic Extractor (STM)" component. For the second input each row should be a term and the following columns should be available: the term (String type), the topic id and the weight.

Output Ports

Icon
automatically created Topic Names

Nodes

Extensions

Links