Regex Extractor

Go to Product

Create regular expressions as easy as a breeze. This node allows you to build, preview, and test your regexes in real time with your real data. Say good bye to creating your regex in KNIME with trial and error, or copying back and forth regular expressions and text between your favorite regex tool and KNIME. The user interface is inspired by RegExr and Regular Expressions 101.

The node uses Java’s Pattern implementation. For each row or text input, it will extract all matches. Each capturing group is automatically mapped to a KNIME column. To define the column names, you can make use of “named capturing groups”, such as (?<name>[A-Z-a-z]+).

To exclude groups from the output, define them as “non-capturing group”: (?:).

Usage Example: Extract email addresses from text

For this input text: “Hello, world! mail@palladian.ai The quick brown fox jumps over the lazy dog. bob@example.com Lorem ipsum.” and the \b(?<Local Part>[A-Z0-9._%+-]+)@(?<Domain>[A-Z0-9.-]+\.[A-Z]{2,})\b, the node will create a KNIME table as follows:

Full Match Local Part Domain
mail@palladian.ai mail palladian.ai
bob@example.com bob example.com

Standalone Mode

The node can be used with an input table, which provides a string column, or in “standalone” mode (when no input table is connected), where you can enter the text into the node dialog. Click the three dots on the node to switch between standalone mode and input table.

Further reading

For a general introduction into regular expressions, have a look at Regular-Expressions.info.

Options

Input
When an input table is connected, select the column with the text from which to extract.
Drop column
Enable to remove the input column from the result table.
Preview row
Page through the first 25 input rows and show their content in the preview.
Output
Specify how the output table should be composed. The following options are available:
  • Rows: Create one new row for each match. In case there is more than one match, this will append multiple rows per input (i.e. the output table will likely contain more rows than the input table), or no row in case no match was found (output table will contain less rows than the input table). The table will have an appended column for each capturing group (and one for the entire match).
  • Rows or Missing: Same as “Rows”, but append a row with missing value cells in case of no match for an input row.
  • Single Row: Append only first match (or missing value cells, in case there was no match in the input data). In this case, the input and output table will contain exactly the same numbers of rows.
  • Columns: Output one row for each input row and append the matches as separate columns.
  • JSON: Append a JSON object with the extracted results. The JSON contains a matches array with all matches. Each match object contains a groups array. A group has the properties start, end (offset within the input text), groupIdx (running index), and the value.
  • List: Append a list cell which contains the flattened structure of all groups of all matches. This is useful e.g. when splitting a text into tokens.
  • Is Match (Boolean): Append a boolean cell which is true if the expression matched, false otherwise.
  • Match Count (Number): Append an integer cell which contains the number of matches.
Note: The output type “Columns” is not available when using this node with “Streaming”.
Name/prefix
A column name (in case of “JSON”, “List”, “Is Match”, or “Match Count”) or prefix (in case of “Rows”, “Single Row” or “Columns”) for the appended columns. In the “Columns” mode, a placeholder $MATCHINDEX for the index must be used to avoid naming conflicts
Append RowID
Append the input Row Key when using the “Rows” option. This allows to map/join back the results to the corresponding input rows.
No Full Match
Enable to not append the “Full Match” column to the output table.
Text
Contains the text preview with the matches highlighted groups in color. In case the node is executed in “standalone” mode, you can input or paste your own text here.
Regex
Enter the regular expression. The “Preview” section updates in real time to show the extraction results
Flags
Select flags for the regular expression. Select from the following:
  • CANON_EQ: Enables canonical equivalence.
  • CASE_INSENSITIVE: Enables case-insensitive matching.
  • COMMENTS: Permits whitespace and comments in pattern.
  • DOTALL: Enables dotall mode.
  • LITERAL: Enables literal parsing of the pattern.
  • MULTILINE: Enables multiline mode.
  • UNICODE_CASE: Enables Unicode-aware case folding.
  • UNICODE_CHARACTER_CLASS: Enables the Unicode version of Predefined character classes and POSIX character classes.
  • UNIX_LINES: Enables Unix lines mode.
Template
Use, save, and delete your favorite regex templates. The predefined templates are read-only and cannot be modified or deleted. Templates which you create are saved globally (in your KNIME preferences) and thus available in every node. To backup or copy your templates between different KNIME installations, use the “Export Preferences” and “Import Preferences” functionality in the “File” menu. The node comes with the following predefined default templates:
  • Emails
  • Hashtags
  • IPv4 Addresses
  • Link Tags
  • Simple Tokenization
  • URLs (All)
  • URLs (Web)

Input Ports

Icon
No description for this port available.
Icon
Input table with a string column. If this is removed, the node will work in “standalone” mode, i.e. it allows to enter the string to extract into the configuration.

Output Ports

Icon
Output table with the extraction results. The structure depends on the “Mapping” as specified in the configuration.

Popular Predecessors

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.