Document Preprocessing

Document Preprocessing applies a common sequence of preprocessing steps to clean and prepare text for subsequent analysis and comparison with other text. As input, a column containing documents is expected and as output, the newly preprocessed documents is produced.

The Component requires the following extensions:
- KNIME Textprocessing
https://hub.knime.com/knime/extensions/org.knime.features.ext.textprocessing/latest

Options

Erase all punctuations?: If this option is enabled, all punctuations will be removed from the document.
Convert all terms to lower case: Enable or disable (skip) conversion of all words in a document to lower case.
Filter stopwords: Enable or disable (skip) the removal of "stopwords" from documents.
Stemm terms?: Enable or disable (skip) the conversion of words to their word stems in documents.
Document column: Column containing documents to be processed.
Minimum percentage of documents per term: Threshold for the minimum percentage of documents in which a word must appear to be kept in a document.
Maximum percentage of documents per term: Threshold for the maximum percentage of documents in which a word may appear to be kept in a document.
Minimum number of letters per word: Words shorter than this length will be removed from documents.

Document Preprocessing

Options

Input Ports

Output Ports

Nodes

Extensions

Links

Download