This node splits the string content of a selected column into logical groups using regular expressions. A capturing group is usually identified by a pair of parentheses, whereby the pattern in such parentheses is a regular expression. Optionally, a group can be named. See Pattern for more information. For each input, the capture groups are the output values. Those can be appended to the table in different ways; by default, every group will correspond to one additional output column.
A short introduction to groups and capturing is given in the Java API . Some examples are given below:
Patent identifiers such as "US5443036-X21" consisting of
a (at most) two-letter country code ("US"), a patent
number ("5443036") and possibly some application code
("X21"), which is separated by a dash or a space
character, can be grouped by the expression
([A-Za-z]{1,2})([0-9]+)[ \-]?(.*$)
.
Each of the parenthesized terms corresponds to the
aforementioned properties. For named output columns,
we can add group names to the pattern:
(?<CC>[A-Za-z]{1,2})
is now identified with "CC" in the output.(?<patentNumber>[0-9]+)
is now identified with "patentNumber".[ \-]?
is and was never a capturing group so it remains unchanged.(?<applicationCode>.*$)
is now identified with "applicationCode".
This is particularly useful when this node is used to
parse the file URL of a file reader node (the URL is
exposed as a flow variable and then exported to a table
using a Variable to Table node). The format of such
URLs is similar to "file:c:\some\directory\foo.csv".
Using the pattern
[A-Za-z]*:(.*[/\\])(?<filename>([^\.]*)\.(.*$))
generates four groups: The first group identifies the directory
and is denoted by (.*[/\\])
. It consumes all characters
until a final slash or backslash is encountered; in the example,
this refers to "c:\some\directory\". The second group
represents the file name, whereby it encapsulates the
third and fourth group. The third group (([^\.]*)
)
consumes all characters after the directory,
which are not a dot '.' (which is "foo" in the
above example). The pattern expects a single dot
(final which is ignored) and finally the fourth group (.*$)
,
which reads until the end of the string and indicates
the file suffix ('csv'). The groups for the above
example are
Let's consider a scenario where you have a list of email addresses.
Using the pattern (?<username>.+)@(?<domain>.+)
,
you can extract the username and domain from the addresses.
The groups for the email address "john.doe@example.com" are:
(?<groupName>pattern)
,
where groupName
is the name of the group and pattern
can be replaced by any regular
expression that should be matched. Note that group names need to start with a letter and may
contain only letters and digits, no spaces.(pattern)
, where again pattern
can be replaced by any
pattern. Unnamed capture groups are simply identified by their position in the pattern string, and they
are enumerated starting at 1.(?:pattern)
You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.
To use this node in KNIME, install the extension KNIME Base nodes from the below update site following our NodePit Product and Node Installation Guide:
A zipped version of the software site can be downloaded here.
Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.
Try NodePit Runner!Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well? Do you think, the search results could be improved or something is missing? Then please get in touch! Alternatively, you can send us an email to mail@nodepit.com.
Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.