4 ×

Regex Extractor

Palladian for KNIME version 2.2.0.202005151546 by palladian.ws; Philipp Katz, Klemens Muthmann, David Urbansky

Create regular expressions as easy as a breeze. This node allows you to build, preview, and test your regexes in real time with your real data. Say good bye to creating your regex in KNIME with trial and error, or copying back and forth regular expressions and text between your favorite regex tool and KNIME. The user interface is inspired by RegExr and Regular Expressions 101.

The node uses Java’s Pattern implementation. For each row or text input, it will extract all matches. Each capturing group is automatically mapped to a KNIME column. To define the column names, you can make use of “named capturing groups”, such as (?<name>[A-Z-a-z]+).

To exclude groups from the output, define them as “non-capturing group”: (?:).

Usage Example: Extract email addresses from text

For this input text: “Hello, world! mail@palladian.ai The quick brown fox jumps over the lazy dog. bob@example.com Lorem ipsum.” and the \b(?<Local Part>[A-Z0-9._%+-]+)@(?<Domain>[A-Z0-9.-]+\.[A-Z]{2,})\b, the node will create a KNIME table as follows:

Full Match Local Part Domain
mail@palladian.ai mail palladian.ai
bob@example.com bob example.com

Standalone Mode

The node can be used with an input table, which provides a string column, or in “standalone” mode (when no input table is connected), where you can enter the text into the node dialog.

Further reading

For a general introduction into regular expressions, have a look at Regular-Expressions.info.

Options

Input
When an input table is connected, select the column with the text from which to extract.
Drop column
Enable to remove the input column from the result table.
Preview row
Page through the first 25 input rows and show their content in the preview.
Output
Specify how the output table should be composed. The following options are available:
  • Rows: Create one new row for each match. In case there is more than one match, this will append multiple rows per input (i.e. the output table will likely contain more rows than the input table), or no row in case no match was found (output table will contain less rows than the input table). The table will have an appended column for each capturing group (and one for the entire match).
  • Single Row: Append only first match (or missing value cells, in case there was no match in the input data). In this case, the input and output table will contain exactly the same numbers of rows.
  • Columns: Output one row for each input row and append the matches as separate columns.
  • JSON: Append a JSON object with the extracted results. The JSON contains a matches array with all matches. Each match object contains a groups array. A group has the properties start, end (offset within the input text), groupIdx (running index), and the value.
  • List: Append a list cell which contains the flattened structure of all groups of all matches. This is useful e.g. when splitting a text into tokens.
Name/prefix
A column name (in case of “JSON” or “List”) or prefix (in case of “Rows”, “Single Row” or “Columns”) for the appended columns. In the “Columns” mode, a placeholder $MATCHINDEX for the index must be used to avoid naming conflicts
Append RowID
Append the input Row Key when using the “Rows” option. This allows to map/join back the results to the corresponding input rows.
No Full Match
Enable to not append the “Full Match” column to the output table.
Text
Contains the text preview with the matches highlighted groups in color. In case the node is executed in “standalone” mode, you can input or paste your own text here.
Regex
Enter the regular expression. The “Preview” section updates in real time to show the extraction results
Flags
Select flags for the regular expression. Select from the following:
  • CANON_EQ: Enables canonical equivalence.
  • CASE_INSENSITIVE: Enables case-insensitive matching.
  • COMMENTS: Permits whitespace and comments in pattern.
  • DOTALL: Enables dotall mode.
  • LITERAL: Enables literal parsing of the pattern.
  • MULTILINE: Enables multiline mode.
  • UNICODE_CASE: Enables Unicode-aware case folding.
  • UNICODE_CHARACTER_CLASS: Enables the Unicode version of Predefined character classes and POSIX character classes.
  • UNIX_LINES: Enables Unix lines mode.
Template
Use, save, and delete your favorite regex templates. The node comes with default templates for extracting email addresses, URLs, and IPv4 addresses and for tokenizing texts. The predefined templates are read-only and cannot be modified or deleted. Templates which you create are saved globally (in your KNIME preferences) and thus available in every node. To backup or copy your templates between different KNIME installations, use the “Export Preferences” and “Import Preferences” functionality in the “File” menu.

Input Ports

Icon
Input table with a string column.

Output Ports

Icon
Output table with the extraction results. The structure depends on the “Mapping” as specified in the configuration.

Workflows

Installation

To use this node in KNIME, install Palladian for KNIME from the following update site:

KNIME 4.2

A zipped version of the software site can be downloaded here. Read our FAQs to get instructions about how to install nodes from a zipped update site.

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.