Wildcard Tagger

This node tags terms, that are matching wildcard or regular expressions specified in a string column and assigns a specified tag value and type. Optionally the recognized named entity terms can be set unmodifiable, meaning that the terms are not modified or filtered afterwards by any following node. Matching can be applied case sensitive or case insensitive, on term or on sentence level.

Wildcard matching
Instead of complicated regular expression easy to use wildcards can be specified for matching. Possible wildcards are '*' for any sequence and '?' for any character. It is recommended to use the 'Single term' matching level when using wildcard matching, especially if you are new in the field of regular expressions. If more than one expression matches the last match will override the effects (tagging) of the previous matches. This can lead to "unexpected" behavior, especially when the 'Multi term' matching level is used. If you know what you are doing please don't hesitate to use the 'Multi term' option. For details about the 'Single term' and 'Multi term' option see below.

Example: Sentence "Fulltext1 token1 token2" consists of three terms. Using the two wildcard expressions "Fulltext*" and "token*" in combination with the 'Single term' option results, as expected, in the tagged terms "Fulltext1", "token1", "token2". Using the same wildcard expressions in combination with the 'Multi term' option, results in the tagged term "token1 token2". If this is what you expected, you know how it works and don't have to read any further. If not, here is the explanation. The first expression "Fulltext*" matches on the complete sentence "Fulltext1 token1 token2", tagging the sentence as one single term. The second expression "token*" matches only on the "token1 token2" part of the sentence, which is tagged as a single term as well. Since these two terms are in conflict, the second overrides the first tagging. The order of wildcard (and regular) expressions can be essential to the tagging outcome.

Regular expression matching
Instead of limited wildcards Java regular expressions can be used for matching as well, providing all the flexibility and power for extensive tagging. As the wildcard matching, matching based on regular expressions can be combined with the 'Single term' and 'Multi term' option. With the 'Single term' option the expressions are applied on one single term at a time. If the expression matches completely the term is tagged, otherwise not. With the 'Multi term' option the expressions are applied on complete sentences. If a substring of the sentence matches, and the substring consists of complete terms, the terms are tagged as one term. Thus multi words like "data mining" can be tagged. Furthermore all matches of a sentence are tagged as long as they are not conflicting.

Be aware that based on the specified regular expressions, tagging (matching) can be expensive in terms of processing time due to excessive backtracking. For detailed information about regular expressions in Java see the documentation of java.util.regex.Pattern and the Java tutorial about regular expressions.

Single term level (term based)
Matching is applied on single terms only. Terms are tagged as named entities if at least one of the specified regular expressions matches. If more than one expression is matching the last matching expression overrides previous, conflicting matches. Note: A regular expression must match completely on a term to be tagged. A partial match is not sufficient.

Example: Term "123abc456" matches with regex "\d+[a-z]+\d+" but not with "\d+".
To find terms beginning with a certain string, e.g. "data" use the regular expression "data.*". The expression ".*data" matches to all terms ending with "data". Alternatively when using wildcard expressions "data*" or "*data" can be used.

Multi term level (sentence based):
Matching is applied on sentences. All specified expressions are used for matching. If more than one expression matches, the last matching expression overrides previous, conflicting matches. If multiple terms are matching to one regular expression all terms will be tagged, as long as they do not overlap with the previous term.

Example: Sentence "term1 term2" matches with regex "term\d+" in a way that "term1" and "term2" are tagged as separate terms. Using "[term\d\s]+" as regular expression results in "term1 term2" as one tagged term. Again the regular expression "\d+" would not match.
To find multi words that start with "data", such as "data mining", "data analysis", "data warehouse" for instance the regular expression "data\s+[a-z]+" can be used. With this expression the term "datastore" would not be matching.
The regular expression ".*" matches all terms in a sentence, meaning that the complete sentence will be tagged as one term.

Options

General options

Document column
The column containing the documents to tag.
Replace column
If checked, the documents of the selected document column will be replaced by the new tagged documents. Otherwise the tagged documents will be appended as new column.
Append column
The name of the new appended column, containing the tagged documents.
Word tokenizer
Select the tokenizer used for word tokenization. Go to Preferences -> KNIME -> Textprocessing to read the description for each tokenizer.
Number of maximal parallel tagging processes
Defines the maximal number of parallel threads that are used for tagging. Please note, that for each thread a tagging model will be loaded into memory. If this value is set to a number greater than 1, make sure that enough heap space is available, in order to be able to load the models. If you are not sure how much heap is available for KNIME, leave the number to 1.

Tagger options

Expression column
Specifies the string column containing the wildcard or regular expressions to match.
Set named entities unmodifiable
Sets recognized named entity terms unmodifiable.
Case sensitive
If checked, matching will be done case sensitive, otherwise not.
Matching method
The wildcard matching method allows for easy to use wildcards, such as '*' for any sequence and '?' for any character. Wildcards are internally translated into regular expressions, whereas '*' is translated into '.*' and '?' into '.'.
The regular expression matching allows for fully fledged Java regular expressions. Please be aware that with great power comes great responsibility. Matching can be computationally expensive based on the specified expressions. Choose your expressions wisely!
Matching level
Expressions can be matched on term ('Single term') or on sentence ('Multi term') level. If the single term option is selected, expressions will only be applied on single terms. If a term matches to one of the specified expressions the term will be tagged. Multiple words (terms) can not be tagged with the single term setting. If the multi term option is selected, expressions will be applied on sentence level. If a part of the sentence matches, the corresponding terms will be tagged. Multiple words can thus be tagged.
Tag type
Specifies the tag type of which tag values can be chosen.
Tag value
Specifies the tag value to use for tagging recognized named entities.

Input Ports

Icon
The input table containing the documents to tag.
Icon
The input table containing the string column with the expression to match with.

Output Ports

Icon
An output table containing the tagged documents.

Popular Predecessors

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.