Text Preprocessing

The Text Preprocessing component uses extremely fast regex-based text processing to remove specific types of characters from a String column and normalize the data as much as possible without over-processing.

This component eliminates the need to convert text to a Document type in order to preprocess it. It also executes extremely quickly compared to various other approaches, promoting scalability.

Each option in the configuration is processed independently of the others. There is an order of operations (which can be audited or edited by drilling into the component).

Options

Convert to Lowercase
Convert the text to lowercase.
Remove Periods
Remove Periods.
Remove Apostrophes
Remove instances of apostrophes (e.g. "John's")
Remove Numbers
Remove Numbers.
Normalize Spaces
Normalize multiple spaces (e.g. " ") into single spaces (" ").
Remove Parentheses
Remove parentheses.
Remove All Spaces
Remove all whitespace from the text.
Remove All Punctuation
Remove all punctuation from the text.
Remove Commas
Remove commas.
Column
A column of type String which will have the selected text processing step(s) applied to it.

Input Ports

Icon
A table of input data with at least one String column.

Output Ports

Icon
A table in which the input table's selected column has been processed and replaced.

Nodes

Extensions

Links