Unique Term Extractor

This node creates a global set of terms over all documents. Optionally, it is possible to filter the top-k words in terms of frequencies. There are three different frequencies to choose from for filtering: the term frequency, the document frequency and the inverse document frequency.

  • Term Frequency (TF): Overall count of a term in all documents.
  • Document Frequency (DF): Number of documents in which a term occurs.
  • Inverse Document Frequency (IDF): The logarithm of the total number of documents divided by the DF.
More information about term frequencies can be found here.

Options

Document column
Select the document column to extract the terms from.
Most frequent terms (k)
Check, if the data table should be restricted on the top k most frequent terms.
Filter terms by
If the 'Most frequent terms (k)' option is checked, the terms are sorted by the selected frequency method (TF, DF or IDF). Only the top-k most frequent terms are then added to the data table.
Append index column
If checked, the node appends an index column containing a unique index for each term. This is especially useful for replacing words with numbers while preparing documents for deep learning.
Append frequency columns
If checked, the node appends a term frequency (TF), document frequency (DF) and inverse document frequency (IDF) column.
Number of threads
The number of threads used to process the documents.

Input Ports

Icon
The input table containing the documents.

Output Ports

Icon
An output table containing a unique term column, frequency columns and an index column.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.