This node analyses documents and extracts relevant keywords
using cooccurrence statistics as described in
"Keyword extraction from a single document using word co-occurrence
statistical information" by Y.Matsuo and M. Ishizuka.

First, the most frequent terms (see node settings) are extracted and
then clustered together using the pointwise mutual information and
a normalized version of the L1 norm as measures of distance between
their cooccurrence probability distributions.

A term can be considered as member of a cluster if it is similar to
all the terms inside it according to at least one of the similarity
measures. If more than one cluster meets this condition, the
one with the highest average score will be used. If no cluster
is similar, a new one is created.

Once this is done, each term is ranked
in decreasing order of the deviation between their expected cluster
cooccurrence and the actual observed cooccurrence value. The terms
with the highest divergence are returned as keywords.

Setting the console's output level to DEBUG will make this node
display the set of frequent terms, the distance between them during
the clustering phase and the final clusters.
terms.

- Document column
- The name of the column which contains the documents to analyse.
- Number of keywords to extract
- The number of keywords to extract per document.
- Percentage of unique terms in the document to use for the chi-square measures
- The percentage of the set of unique terms in the document to use to build the term clusters. The article this node is based on provides 30% as a rule of thumb.
- Ignore tags
- If this option is checked, the node will only compare terms based on their word content. In other words, tags and any other meta information will be ignored. This will not affect the output documents, only the way they are analysed.
- Pointwise mutual information threshold
- Terms whose pointwise mutual information score is greater than or
equal to this value will be considered as similar and thus clustered
together.

This similarity measure typically ranges from 0 to infinity but has been normalized from 0 to 1 using arctan(value)/(pi/2). It measures the discrepancy between the actual cooccurrence probability and the one if both terms were completely independent. - Normalized L1 norm threshold
- Terms whose normalized L1 norm score is greater than or
equal to this value will be considered as similar and thus clustered
together.

This similarity measure ranges from 0 to 1 inclusively. It measures the similarity between the cooccurrence probability of every term in the document with the terms (P(t|first term) vs P(t|second term) for every possible t).

- This node has no views

- 01_Emil_the_TeacherBotKNIME Hub
- 01_Initial_Model_TrainingKNIME Hub
- 02_AL_First_Try_Assign_Classes_via_DistanceKNIME Hub
- 02_Document_ClassificationKNIME Hub
- 03_AL_Training_Subset_Uncertain_ClassesKNIME Hub
- Show all 11 workflows

- No links available

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

To use this node in KNIME, install the extension KNIME Textprocessing from the below update site following our NodePit Product and Node Installation Guide:

v4.5

A zipped version of the software site can be downloaded here.

Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well? Do you think, the search results could be improved or something is missing? Then please get in touch! Alternatively, you can send us an email to mail@nodepit.com, follow @NodePit on Twitter, or chat on Gitter!

**Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.**