Chi-Square Keyword Extractor

This node analyses documents and extracts relevant keywords using cooccurrence statistics as described in "Keyword extraction from a single document using word co-occurrence statistical information" by Y.Matsuo and M. Ishizuka.
First, the most frequent terms (see node settings) are extracted and then clustered together using the pointwise mutual information and a normalized version of the L1 norm as measures of distance between their cooccurrence probability distributions.
A term can be considered as member of a cluster if it is similar to all the terms inside it according to at least one of the similarity measures. If more than one cluster meets this condition, the one with the highest average score will be used. If no cluster is similar, a new one is created.
Once this is done, each term is ranked in decreasing order of the deviation between their expected cluster cooccurrence and the actual observed cooccurrence value. The terms with the highest divergence are returned as keywords.
Setting the console's output level to DEBUG will make this node display the set of frequent terms, the distance between them during the clustering phase and the final clusters. terms.

Options

Document column
The name of the column which contains the documents to analyse.
Number of keywords to extract
The number of keywords to extract per document.
Percentage of unique terms in the document to use for the chi-square measures
The percentage of the set of unique terms in the document to use to build the term clusters. The article this node is based on provides 30% as a rule of thumb.
Ignore tags
If this option is checked, the node will only compare terms based on their word content. In other words, tags and any other meta information will be ignored. This will not affect the output documents, only the way they are analysed.
Pointwise mutual information threshold
Terms whose pointwise mutual information score is greater than or equal to this value will be considered as similar and thus clustered together.
This similarity measure typically ranges from 0 to infinity but has been normalized from 0 to 1 using arctan(value)/(pi/2). It measures the discrepancy between the actual cooccurrence probability and the one if both terms were completely independent.
Normalized L1 norm threshold
Terms whose normalized L1 norm score is greater than or equal to this value will be considered as similar and thus clustered together.
This similarity measure ranges from 0 to 1 inclusively. It measures the similarity between the cooccurrence probability of every term in the document with the terms (P(t|first term) vs P(t|second term) for every possible t).

Input Ports

Icon
The input table which contains the documents to analyse.

Output Ports

Icon
The output table which contains (keyword term, deviation value, associated document) tuples.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.