This node analyses documents and extracts relevant keywords
using cooccurrence statistics as described in
"Keyword extraction from a single document using word co-occurrence
statistical information" by Y.Matsuo and M. Ishizuka.
First, the most frequent terms (see node settings) are extracted and then clustered together using the pointwise mutual information and a normalized version of the L1 norm as measures of distance between their cooccurrence probability distributions.
A term can be considered as member of a cluster if it is similar to all the terms inside it according to at least one of the similarity measures. If more than one cluster meets this condition, the one with the highest average score will be used. If no cluster is similar, a new one is created.
Once this is done, each term is ranked in decreasing order of the deviation between their expected cluster cooccurrence and the actual observed cooccurrence value. The terms with the highest divergence are returned as keywords.
Setting the console's output level to DEBUG will make this node display the set of frequent terms, the distance between them during the clustering phase and the final clusters. terms.
You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.
A zipped version of the software site can be downloaded here.
Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.Try NodePit Runner!
Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well? Do you think, the search results could be improved or something is missing? Then please get in touch! Alternatively, you can send us an email to firstname.lastname@example.org, follow @NodePit on Twitter, or chat on Gitter!
Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.