Document Preprocessing

Document Preprocessing applies a common sequence of preprocessing steps to clean and prepare text for subsequent analysis and comparison with other text. As input, a column containing documents is expected and as output, the newly preprocessed documents is produced.

The Component requires the following extensions:
- KNIME Textprocessing
https://hub.knime.com/knime/extensions/org.knime.features.ext.textprocessing/latest

Options

Erase all punctuations?
If this option is enabled, all punctuations will be removed from the document.
Convert all terms to lower case
Enable or disable (skip) conversion of all words in a document to lower case.
Filter stopwords
Enable or disable (skip) the removal of "stopwords" from documents.
Stemm terms?
Enable or disable (skip) the conversion of words to their word stems in documents.
Document column
Column containing documents to be processed.
Minimum percentage of documents per term
Threshold for the minimum percentage of documents in which a word must appear to be kept in a document.
Maximum percentage of documents per term
Threshold for the maximum percentage of documents in which a word may appear to be kept in a document.
Minimum number of letters per word
Words shorter than this length will be removed from documents.

Input Ports

Icon
The input table which contains the documents to preprocess.

Output Ports

Icon
Documents which have passed through the preprocessing steps.

Nodes

Extensions

Links