String Splitter (Regex)

This node splits the string content of a selected column into logical groups using regular expressions. A capturing group is usually identified by a pair of parentheses, whereby the pattern in such parentheses is a regular expression. Optionally, a group can be named. See Pattern for more information. For each input, the capture groups are the output values. Those can be appended to the table in different ways; by default, every group will correspond to one additional output column.

A short introduction to groups and capturing is given in the Java API . Some examples are given below:

Parsing Patent Numbers

Patent identifiers such as "US5443036-X21" consisting of a (at most) two-letter country code ("US"), a patent number ("5443036") and possibly some application code ("X21"), which is separated by a dash or a space character, can be grouped by the expression ([A-Za-z]{1,2})([0-9]+)[ \-]?(.*$). Each of the parenthesized terms corresponds to the aforementioned properties. For named output columns, we can add group names to the pattern:

  • (?<CC>[A-Za-z]{1,2}) is now identified with "CC" in the output.
  • (?<patentNumber>[0-9]+) is now identified with "patentNumber".
  • [ \-]? is and was never a capturing group so it remains unchanged.
  • (?<applicationCode>.*$) is now identified with "applicationCode".
Named and unnamed groups can also be mixed in one pattern.

Strip File URLs

This is particularly useful when this node is used to parse the file URL of a file reader node (the URL is exposed as a flow variable and then exported to a table using a Variable to Table node). The format of such URLs is similar to "file:c:\some\directory\foo.csv". Using the pattern [A-Za-z]*:(.*[/\\])(?<filename>([^\.]*)\.(.*$)) generates four groups: The first group identifies the directory and is denoted by (.*[/\\]). It consumes all characters until a final slash or backslash is encountered; in the example, this refers to "c:\some\directory\". The second group represents the file name, whereby it encapsulates the third and fourth group. The third group (([^\.]*)) consumes all characters after the directory, which are not a dot '.' (which is "foo" in the above example). The pattern expects a single dot (final which is ignored) and finally the fourth group (.*$), which reads until the end of the string and indicates the file suffix ('csv'). The groups for the above example are

  1. Group 1: c:\some\directory
  2. Group filename: foo.csv
  3. Group 3: foo
  4. Group 4: csv

Email Address Extraction

Let's consider a scenario where you have a list of email addresses. Using the pattern (?<username>.+)@(?<domain>.+), you can extract the username and domain from the addresses. The groups for the email address "john.doe@example.com" are:

  • Group username: john.doe
  • Group domain: example.com

Options

String column
Choose the column containing the strings to split
Pattern
Define a pattern according to which the input string will be split. The capture groups that are defined in this pattern will correspond to the output values. A group can be defined in one of two ways:
  • For a named group, define (?<groupName>pattern), where groupName is the name of the group and pattern can be replaced by any regular expression that should be matched. Note that group names need to start with a letter and may contain only letters and digits, no spaces.
  • For an unnamed capture group, simply use parenthesis around your pattern: (pattern), where again pattern can be replaced by any pattern. Unnamed capture groups are simply identified by their position in the pattern string, and they are enumerated starting at 1.
If you want to use non-capturing groups, construct them with (?:pattern)
Case sensitive
Specifies whether matching will distinguish between upper and lower case letters.
By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by enabling Unicode-aware case folding.
Matching case-insensitive may impose a slight performance penalty.
Require whole string to match
If enabled, the provided pattern must match the whole string in order to return any results. Otherwise, the first match in the input string is used.
If pattern does not match
Define what to do if a pattern can't be matched to the input string:
  • Insert missing value puts missing cell(s) in place of the output column(s). The node will emit a warning when an input string doesn't match.
  • Insert empty string puts empty string(s) in place of the output column(s). The node will emit a warning when an input string doesn't match.
  • Fail causes the node to fail if one of the inputs can not be matched against the pattern.
Enable Unix lines mode
In this mode, only the '\n' line terminator is recognized in the behavior of ., ^, and $.
Enable multiline mode (^ and $ match at the beginning / end of a line)
In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence.
Enable dotall mode (Dot . also matches newline characters)
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
Enable Unicode-aware case folding
When this is enabled then case-insensitive matching, when enabled, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched.
Enabling this may impose a performance penalty.
Enable canonical equivalence
When enabled, two characters will be considered to match if, and only if, their full canonical decompositions match. The expression "a\u030A", for example, will match the string "\u00E5" when this is enabled. By default, matching does not take canonical equivalence into account.
Enable Unicode character classes
When enabled, the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with the Unicode Standard.
Enabling this may impose a performance penalty.
Start group index counting from zero
If enabled, the indices of non-named capturing groups start at zero instead of one. This setting is not meant to be manually enabled, but exists solely for the purpose of backwards-compatibility. Earlier versions of this node have this enabled to reflect how the node used to behave.
Output matched groups as
Define how to output the results:
  • Columns: Each capture group in the defined pattern creates a new column in the output table. The column names correspond to the names of the named capture groups.
  • Rows: Each input row is duplicated by the number of capture groups, and every capture is added to one of those copies.
  • List: The captures are appended to the input as a list of strings.
  • Set (remove duplicates): The captures are appended to the input as a set of strings. Note that duplicates are removed and the order of captures is not preserved.
Output column prefix
Define what prefix should be used for the output column names:
  • Input column name: The name of the column containing the string to split is used as a prefix.
  • Custom: Define a custom string that shall be used as a prefix.
  • None: No prefix is added.
Custom prefix
Define a custom column prefix.
Output column
Choose whether to append the output column or replace the input column.
Output column name
Choose a name for the output column
Group labels in output
Define the naming of the output groups:
  • Capture group names or indices: Use the names of the capture groups. For unnamed capture groups, their index is used as their label.
  • Split input column name: Apply the provided pattern to the name of the input column and use the captures as labels.
The impact of this setting depends on the selected Output mode:
  • Columns: The labels will be used as the suffix of the column names.
  • Rows: The labels will be used as the suffix of the row IDs.
  • List and Set: The labels will be used as element names in the collection cell specification.
Remove input column
Remove the input column from the output table.

Input Ports

Icon
Input table with string column to be split.

Output Ports

Icon
Input table with additional column(s) and potentially duplicated rows representing the pattern groups.See "Output matched groups as" for more details.

Popular Predecessors

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.