Wildcard Tagger

This node tags terms, that are matching wildcard or regular expressions specified in a string column and assigns a specified tag value and type. Optionally the recognized named entity terms can be set unmodifiable, meaning that the terms are not modified or filtered afterwards by any following node. Matching can be applied case sensitive or case insensitive, on term or on sentence level.

Wildcard matching
Instead of complicated regular expression easy to use wildcards can be specified for matching. Possible wildcards are '*' for any sequence and '?' for any character. It is recommended to use the 'Single term' matching level when using wildcard matching, especially if you are new in the field of regular expressions. If more than one expression matches the last match will override the effects (tagging) of the previous matches. This can lead to "unexpected" behavior, especially when the 'Multi term' matching level is used. If you know what you are doing please don't hesitate to use the 'Multi term' option. For details about the 'Single term' and 'Multi term' option see below.

Example: Sentence "Fulltext1 token1 token2" consists of three terms. Using the two wildcard expressions "Fulltext*" and "token*" in combination with the 'Single term' option results, as expected, in the tagged terms "Fulltext1", "token1", "token2". Using the same wildcard expressions in combination with the 'Multi term' option, results in the tagged term "token1 token2". If this is what you expected, you know how it works and don't have to read any further. If not, here is the explanation. The first expression "Fulltext*" matches on the complete sentence "Fulltext1 token1 token2", tagging the sentence as one single term. The second expression "token*" matches only on the "token1 token2" part of the sentence, which is tagged as a single term as well. Since these two terms are in conflict, the second overrides the first tagging. The order of wildcard (and regular) expressions can be essential to the tagging outcome.

Regular expression matching
Instead of limited wildcards Java regular expressions can be used for matching as well, providing all the flexibility and power for extensive tagging. As the wildcard matching, matching based on regular expressions can be combined with the 'Single term' and 'Multi term' option. With the 'Single term' option the expressions are applied on one single term at a time. If the expression matches completely the term is tagged, otherwise not. With the 'Multi term' option the expressions are applied on complete sentences. If a substring of the sentence matches, and the substring consists of complete terms, the terms are tagged as one term. Thus multi words like "data mining" can be tagged. Furthermore all matches of a sentence are tagged as long as they are not conflicting.

Be aware that based on the specified regular expressions, tagging (matching) can be expensive in terms of processing time due to excessive backtracking. For detailed information about regular expressions in Java see the documentation of java.util.regex.Pattern and the Java tutorial about regular expressions.

Single term level (term based)
Matching is applied on single terms only. Terms are tagged as named entities if at least one of the specified regular expressions matches. If more than one expression is matching the last matching expression overrides previous, conflicting matches. Note: A regular expression must match completely on a term to be tagged. A partial match is not sufficient.

Example: Term "123abc456" matches with regex "\d+[a-z]+\d+" but not with "\d+".
To find terms beginning with a certain string, e.g. "data" use the regular expression "data.*". The expression ".*data" matches to all terms ending with "data". Alternatively when using wildcard expressions "data*" or "*data" can be used.

Multi term level (sentence based):
Matching is applied on sentences. All specified expressions are used for matching. If more than one expression matches, the last matching expression overrides previous, conflicting matches. If multiple terms are matching to one regular expression all terms will be tagged, as long as they do not overlap with the previous term.

Example: Sentence "term1 term2" matches with regex "term\d+" in a way that "term1" and "term2" are tagged as separate terms. Using "[term\d\s]+" as regular expression results in "term1 term2" as one tagged term. Again the regular expression "\d+" would not match.
To find multi words that start with "data", such as "data mining", "data analysis", "data warehouse" for instance the regular expression "data\s+[a-z]+" can be used. With this expression the term "datastore" would not be matching.
The regular expression ".*" matches all terms in a sentence, meaning that the complete sentence will be tagged as one term.

Options

General options

Document column: The column containing the documents to tag.
Replace column: If checked, the documents of the selected document column will be replaced by the new tagged documents. Otherwise the tagged documents will be appended as new column.
Append column: The name of the new appended column, containing the tagged documents.
Word tokenizer: Select the tokenizer used for word tokenization. Go to Preferences -> KNIME -> Textprocessing to read the description for each tokenizer.
Number of maximal parallel tagging processes: Defines the maximal number of parallel threads that are used for tagging. Please note, that for each thread a tagging model will be loaded into memory. If this value is set to a number greater than 1, make sure that enough heap space is available, in order to be able to load the models. If you are not sure how much heap is available for KNIME, leave the number to 1.

Tagger options

Expression column: Specifies the string column containing the wildcard or regular expressions to match.
Set named entities unmodifiable: Sets recognized named entity terms unmodifiable.
Case sensitive: If checked, matching will be done case sensitive, otherwise not.
Matching method: The wildcard matching method allows for easy to use wildcards, such as '*' for any sequence and '?' for any character. Wildcards are internally translated into regular expressions, whereas '*' is translated into '.*' and '?' into '.'.
The regular expression matching allows for fully fledged Java regular expressions. Please be aware that with great power comes great responsibility. Matching can be computationally expensive based on the specified expressions. Choose your expressions wisely!
Matching level: Expressions can be matched on term ('Single term') or on sentence ('Multi term') level. If the single term option is selected, expressions will only be applied on single terms. If a term matches to one of the specified expressions the term will be tagged. Multiple words (terms) can not be tagged with the single term setting. If the multi term option is selected, expressions will be applied on sentence level. If a part of the sentence matches, the corresponding terms will be tagged. Multiple words can thus be tagged.
Tag type: Specifies the tag type of which tag values can be chosen.
Tag value: Specifies the tag value to use for tagging recognized named entities.

Input Ports

: The input table containing the documents to tag.
: The input table containing the string column with the expression to match with.

Output Ports

: An output table containing the tagged documents.

Popular Predecessors

Popular Successors

Views

This node has no views

Workflows

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME Textprocessing from the below update site following our NodePit Product and Node Installation Guide:

v5.5

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.5.0.v202412191419

On NodePit since: 2025-07-02

Last update: 2025-07-31

Tags: Streamable

KNIME versions: Since v3.6

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!