Keygraph Keyword Extractor

This node analyses documents and extracts relevant keywords using the graph-based approach described in "KeyGraph: Automatic Indexing by Co-occurrence Graph based on Building Connstruction Metaphor" by Yukio Ohsawa.
First, a predetermined amount of terms are selected based on their frequency (high frequency set, HF) and added as the initial nodes of the graph.
The association strength between each of these terms is then calculated using the following scoring method: assoc(term1, term2) = min(occurrence frequency of term1, occurrence frequency of term2) summed for every sentence in the document. The top |HF|-1 associations are inserted into the graph as edges.
If an edge between two terms is the only path that connects them, it is pruned.
The graph's connected subgraphs are then extracted and considered as "concept" clusters. A new batch of terms is added based on their key score, which is the conditional probability that a term will be used if the author has all the concepts (clusters) in mind (P(UNION(w|g)) where t is the term and the union is done over every cluster g of the set of clusters.
Each of these new terms is then linked to every cluster using the strongest scoring edge amongst the possible ones.
Finally, all the terms in the graph are rated based on this formula: score(t) = summation over every edge connecting t and other terms (w), summation over every sentences, min(freq(t), freq(w)).
Setting the console's output level to DEBUG will make this node display the contents of the clusters after the pruning phase. terms.

Options

Document column: The name of the column which contains the documents to analyse.
Number of keywords to extract: The number of keywords to extract per document.
Size of the high frequency terms set: The number of terms to use for the high frequency terms set. The article this node is based on provides 30 as a rule of thumb.
Since only the terms present in the graph are evaluated as potential keywords, this parameter should be greater than or equal to the number of keywords that you want to extract.
Size of the high key terms set: The number of terms to use for the high key terms set. The article this node is based on provides 12 as a rule of thumb.
Ignore tags: If this option is checked, the node will only compare terms based on their word content. In other words, tags and any other meta information will be ignored. This will not affect the output documents, only the way they are analysed.

Input Ports

: The input table which contains the documents to analyse.

Output Ports

: The output table which contains (keyword term, score, associated document) tuples.

Popular Predecessors

Popular Successors

Views

This node has no views

Workflows

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME Textprocessing from the below update site following our NodePit Product and Node Installation Guide:

v5.5

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.5.0.v202412191419

On NodePit since: 2025-07-02

Last update: 2025-07-30

KNIME versions: Since v3.6

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!