Topic Extractor (STM)

The component trains an STM topic model via unsupervised learning. It integrates with the R implementation of Structural Topic Models (STM), following Roberts, Stewart and Tingley, Journal of Statistical Software (2019) (cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf), via the R library 'stm' (cran.r-project.org/web/packages/stm).

On its first execution the component is set up to automatically install R and all the required libraries. For this to work you need to install Conda (we recommend via "docs.conda.io/en/latest/miniconda.html"). KNIME Analytics Platform can automatically find the default path of where Conda is installed. You can make sure KNIME Analytics Platform is using the correct path via "File > Preferences > KNIME > Conda".

DISCLAIMER: this component won't work on Apple M1 systems as the 'stm' package is not available for 'osx-arm64' via 'conda-forge' ("anaconda.org/conda-forge/r-stm"). For Apple Intel systems manual installation of additional software might be required after the Conda Environment Propagation node executes. For details visit: docs.knime.com/latest/r_installation_guide

Use the component settings to select a document in the column type from the KNIME Textprocessing Extension. Simply apply the Strings to Document node and any other preprocessing required (stopwords removal, stemming, ...) upstream of this component.

Given K, the number of topics to be created, it returns the predicted topic for each document as well as a set of terms representing each of the K topics.

Optionally you can provide metadata columns and fields to the algorithm. Metadata fields are extracted from the document column type. Metadata columns are simply additional columns you provide at the input.

Make sure to provide an operator (+. -, / ,*) for the automated 'Prevalence Formula' when you provide more than one metadata field/column.

Options

Enable Optimal K Search (Computationally Expensive): Enable optimal K search via this checkbox. When enabled parameter optimization for K takes place using a grid search and scoring topic modeling performance on the held-out set. Provide below stepping, range and metric to perform such optimization. Keep it in mind this might take several minutes. If enabled the component also offers a composite view of the results (Right Click on Component > "Open Interactive View'')%%00010
Select Additional Metadata from Input: Select the input columns containing additional metadata that you would like to use your topic modeling analysis. Only string, double and integer types can be selected.
Select Document: Select the column with preprocessed documents. Apply the Strings to Document node and any other preprocessing required (stopwords removal, stemming, ...) before this component. These nodes can be found in the KNIME Textprocessing Extension.
Sigma Prior: A scalar between 0 and 1 which defaults to 0.
Proportion Held-Out Set: Proportion of docs to be held out to compute scores returned at the last component output.
Number of Topics ( K ): The number of topics the model should find.
Seed: Seed to be adopted in the random number generator. Keep the same value to replicate results on the same input and settings.
Number of Terms per Topic: Filter how many terms should be displayed for each topic using higher weights. The weight comes from the 'beta matrix' table from the R 'stm_tidiers' function. This parameter is also used to compute the last output via the function 'searchK'.
Optimal K Max: Starting from the K provided above, provide how far you want to search is delimited by this value. The optimal K search will go from the provided K to the value provided here.
Optimal K Stepping: Provide here the granularity of the grid search for the optimal K.
Select Metadata Fields from Document Column: This topic modeling algorithm uses metadata fields attached to the document column. Please select which fields you would like to use.
Document Language: Language setting to prepare the vocabulary for topic modeling.
Method of STM Initialization: The method of initialization, by default the spectral initialization.
Gamma Prior: Sets the prior estimation method for the prevalence covariate model.
Kappa Prior: Sets the prior estimation for the content covariate coefficients.
Select Operator for Prevalence Formula: Optionally select here which formula operator you would like to use between the metadata variables selected above. The component automatically creates a simple formula between metadata fields/columns.
Spline Functions for Prevalence Formula: Select a spline function if you want to apply it to numerical metadata fields/columns.
Optimal K Metric: Select here which metric should be used to optimize automatically the parameter K. No precise method exists for selecting the best K automatically. Despite this four metrics can help in making this decision: exclusivity, coherence, residual variance, and held-out likelihood. The higher the exclusivity the more each topic is composed of terms unique between topics. The higher the semantic coherence the more similar words are included in the individual topics. The lower the residual variance the better the model fits. The higher the held-out likelihood the better model predicts new documents. Increasing K should decrease coherence, increase exclusivity, decrease residual variance, but can lead to overfitting, reducing the held-out likelihood.

Input Ports

: Data table with the document collection to analyze in the KNIME Textprocessing column type (use the 'Strings to Document' node first). Each row contains one document. Documents can be pre-processed (stopwords removal, stemming, ...).

Output Ports

: The R object with the trained model. Use the component "Topic Assigner (STM)" to apply this model to new documents.
: The document collection with topic assignments and the probability for each document to belong to a certain topic. Such probabilities are taken from the gamma/theta matrix returned by the 'stm_tidiers' R function. Missing values are listed for rows with missing text or selected metadata fields/columns.
: The topic models with the terms and their weight per topic. The weight is taken from the beta matrix returned by the 'stm_tidiers' R function. The table lists a maximum number of terms per topic based on the component setting.
: A table listing metrics for the model on an automatically held-out partition of documents. One row for each K tested is provided if the "Optimal K Search'' is enabled. No precise method exists for selecting the best K automatically. Despite this four metrics can help in making this decision: exclusivity, coherence, residual variance, and held-out likelihood. The higher the exclusivity the more each topic is composed of terms unique between topics. The higher the semantic coherence the more similar words are included in the individual topics. The lower the residual variance the better the model fits. The higher the held-out likelihood the better model predicts new documents. Increasing K should decrease coherence, increase exclusivity, decrease residual variance, but can lead to overfitting, reducing the held-out likelihood.

Topic Extractor (STM)

Options

Input Ports

Output Ports

Nodes

Extensions

Links

Download