String Cleaner

The String Cleaner node provides tools for basic string cleaning operations like removing whitespace, removing punctuation or padding. For more complex string manipulation operations, the String Manipulation node should be used. All operations in this node are applied exactly in the order as they appear in the node dialog. The node is unicode-compatible. For reference on what characters are included in a category, see e.g. the Wikipedia guides to Character categories and Whitespace.

Options

Columns to clean
Select which columns should be cleaned. The strings in these columns will be modified according to the configuration of this node.
Remove accents and diacritics
When enabled, all accents and diacritics are removed from letters, leaving only the underlying letter. For example, Å becomes A, ë becomes e and だ becomes た.
Remove non-ASCII characters
When enabled, all non-ASCII characters are removed from strings.
Remove non-printable characters
When enabled, all non-printable characters like a tabulator or non-break space are removed from strings.
Remove letters
Select whether and what category of letters should be removed from strings.
  • None removes no letters.
  • All removes all letters from a string. This includes letters from any language / script.
  • Uppercase removes all uppercase letters from a string.
  • Lowercase removes all lowercase letters from a string.
Note that this step happens before potentially changing the casing of the string.
Remove numbers
When enabled, all numbers are removed from strings. This includes e.g. 3, Ⅴ or ₅.
Remove punctuation
When enabled, all punctuation is removed from strings. This includes e.g. _, ( or !.
Remove symbols
When enabled, all symbols are removed from strings. This includes e.g. €, = or ♮.
Other characters to remove
Here, custom characters can be defined that should be removed from strings. The characters are all interpreted literally and case-sensitive. Note that this step happens before potentially changing the casing of the string.
Remove all whitespace
If enabled, all whitespace is removed. This includes normal space ( ), line breaks (\r\n), tabulators (\t) and all other whitespace.
Remove leading whitespace
If enabled, leading whitespace is removed, that is, all whitespace from the start of the string to the first non-whitespace character.
Remove trailing whitespace
If enabled, trailing whitespace is removed, that is, all whitespace from the last non-whitespace character to the end of the string.
Remove duplicate whitespace
If selected, all occurrences of two or more whitespace characters in a row are replaced by a single standard space.
Line breaks
Select whether to remove line breaks or replace them with a standard space. If Replace by space is selected, \r\n is replaced by only a single space, not two.
Special whitespace
Select whether to remove special whitespace or replace it with a standard space. Special whitespace is all whitespace that is not a standard space, \r or \n.
Change casing
Define the casing of letters in the output string.
  • None makes no changes to the string.
  • Uppercase converts all letters to uppercase.
  • Capitalize first converts all letters to lowercase, then capitalizes according to the defined settings.
  • Lowercase converts all letters to lowercase.
Capitalize after
Define after which characters a character should be capitalized.
  • Whitespace capitalizes letters that follow immediately after a whitespace character
  • Non-Letters capitalizes letters that follow any non-letter character.
  • Custom lets the user decide on a set of characters after which letters should be capitalized
Capitalize after characters
Here, custom characters can be defined after which characters should be capitalized. The characters are all interpreted literally.
Capitalize character at the start of the string
If enabled, a letter at the start of the string is always capitalized.
Pad
Define whether to pad a string if it doesn't have a certain minimum length.
  • None makes no change to the string.
  • Start adds a fill character to the start of the string.
  • End adds a fill character to the end of the string.
Minimum string length
Define the minimum string length. If a string is shorter than this value, a pad will be added to make the string length equal to it.
Fill character
Define the fill character that is used to pad the string.
Output columns
Define whether to clean the strings in-place (i.e. replace the existing values) or append (a) new column(s).
Output column suffix
Define a suffix that is appended to the column names of the input table.

Input Ports

Icon
Input table containing string columns

Output Ports

Icon
Output table with the selected columns modified

Popular Predecessors

  • No recommendations found

Popular Successors

  • No recommendations found

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.