Keygraph Keyword Extractor

This node analyses documents and extracts relevant keywords using the graph-based approach described in "KeyGraph: Automatic Indexing by Co-occurrence Graph based on Building Connstruction Metaphor" by Yukio Ohsawa.
First, a predetermined amount of terms are selected based on their frequency (high frequency set, HF) and added as the initial nodes of the graph.
The association strength between each of these terms is then calculated using the following scoring method: assoc(term1, term2) = min(occurrence frequency of term1, occurrence frequency of term2) summed for every sentence in the document. The top |HF|-1 associations are inserted into the graph as edges.
If an edge between two terms is the only path that connects them, it is pruned.
The graph's connected subgraphs are then extracted and considered as "concept" clusters. A new batch of terms is added based on their key score, which is the conditional probability that a term will be used if the author has all the concepts (clusters) in mind (P(UNION(w|g)) where t is the term and the union is done over every cluster g of the set of clusters.
Each of these new terms is then linked to every cluster using the strongest scoring edge amongst the possible ones.
Finally, all the terms in the graph are rated based on this formula: score(t) = summation over every edge connecting t and other terms (w), summation over every sentences, min(freq(t), freq(w)).
Setting the console's output level to DEBUG will make this node display the contents of the clusters after the pruning phase. terms.

Options

Document column
The name of the column which contains the documents to analyse.
Number of keywords to extract
The number of keywords to extract per document.
Size of the high frequency terms set
The number of terms to use for the high frequency terms set. The article this node is based on provides 30 as a rule of thumb.
Since only the terms present in the graph are evaluated as potential keywords, this parameter should be greater than or equal to the number of keywords that you want to extract.
Size of the high key terms set
The number of terms to use for the high key terms set. The article this node is based on provides 12 as a rule of thumb.
Ignore tags
If this option is checked, the node will only compare terms based on their word content. In other words, tags and any other meta information will be ignored. This will not affect the output documents, only the way they are analysed.

Input Ports

Icon
The input table which contains the documents to analyse.

Output Ports

Icon
The output table which contains (keyword term, score, associated document) tuples.

Popular Predecessors

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.