Regex Split

This node splits the string content of a selected column into logical groups using regular expressions. A group is identified by a pair of parentheses, whereby the pattern in such parentheses is a regular expression. Each content of each group is appended as an individual column. All appended columns will contain missing values if the input string is not completely matched by the selected regular expression.

A short introduction to Groups and Capturing is given by in the Java API . Some examples are given below:

Parsing Patent Numbers

Patent identifiers such as "US5443036-X21" consisting of a (at most) two letter country code ("US"), a patent number ("5443036") and possibly some application code ("X21"), which is separated by a dash or a space character, can be grouped by the expression ([A-Za-z]{1,2})([0-9]*)[ \-]*(.*$). Each of the parenthesized terms corresponds to the aforementioned properties.

Strip File URLs

This is particularly useful when this node is used to parse the file URL of a file reader node (the URL is exposed as flow variable and then exported to a table using a Variable to Table node). The format of such URLs is similar to "file:c:\some\directory\foo.csv". Using the pattern [A-Za-z]*:(.*[/\\])(([^\.]*)\.(.*$)) generates four groups (by counting the number of opening parentheses): The first group identifies the directory and is denoted by "(.*[/\\])". It consumes all characters until a final slash or backslash is encountered; in the example this refers to "c:\some\directory\". The second group represents the file name, whereby it encapsulates the third and fourth group. The third group (denoted by "([^\.]*)") consumes all characters after the directory, which are not a dot '.' (which is "foo" in the above example). The pattern expects a single dot (which is ignored) and finally the fourth group "(.*$)", which reads until the end of the string and indicates the file suffix ('csv'). The groups for the above example are

  1. c:\some\directory
  2. foo.csv
  3. foo
  4. csv

Options

Ignore Case
Enables case-insensitive matching.
Multiline
Enables multiline mode. This option only matters if the input string has line breaks. When enabled, the expressions ^ and $ match at the beginning and end of each line, respectively. By default, they only match at the beginning and end of the entire input string.
Dot matches all characters
Enables dotall / single-line mode. This option only matters if the input string has line breaks. When enabled, the dot expression . matches the line terminator (\n) and every other character. By default, it matches every character except the line terminator.

Input Ports

Icon
Input table with string column to be split.

Output Ports

Icon
Input table amended by additional column representing the pattern groups.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.