This node tags terms, that are matching wildcard or regular expressions specified in a string column and assigns a specified tag value and type. Optionally the recognized named entity terms can be set unmodifiable, meaning that the terms are not modified or filtered afterwards by any following node. Matching can be applied case sensitive or case insensitive, on term or on sentence level.
Instead of complicated regular expression easy to use wildcards can be specified for matching. Possible wildcards are '*' for any sequence and '?' for any character. It is recommended to use the 'Single term' matching level when using wildcard matching, especially if you are new in the field of regular expressions. If more than one expression matches the last match will override the effects (tagging) of the previous matches. This can lead to "unexpected" behavior, especially when the 'Multi term' matching level is used. If you know what you are doing please don't hesitate to use the 'Multi term' option. For details about the 'Single term' and 'Multi term' option see below.
Example: Sentence "Fulltext1 token1 token2" consists of three terms. Using the two wildcard expressions "Fulltext*" and "token*" in combination with the 'Single term' option results, as expected, in the tagged terms "Fulltext1", "token1", "token2". Using the same wildcard expressions in combination with the 'Multi term' option, results in the tagged term "token1 token2". If this is what you expected, you know how it works and don't have to read any further. If not, here is the explanation. The first expression "Fulltext*" matches on the complete sentence "Fulltext1 token1 token2", tagging the sentence as one single term. The second expression "token*" matches only on the "token1 token2" part of the sentence, which is tagged as a single term as well. Since these two terms are in conflict, the second overrides the first tagging. The order of wildcard (and regular) expressions can be essential to the tagging outcome.
Regular expression matching
Instead of limited wildcards Java regular expressions can be used for matching as well, providing all the flexibility and power for extensive tagging. As the wildcard matching, matching based on regular expressions can be combined with the 'Single term' and 'Multi term' option. With the 'Single term' option the expressions are applied on one single term at a time. If the expression matches completely the term is tagged, otherwise not. With the 'Multi term' option the expressions are applied on complete sentences. If a substring of the sentence matches, and the substring consists of complete terms, the terms are tagged as one term. Thus multi words like "data mining" can be tagged. Furthermore all matches of a sentence are tagged as long as they are not conflicting.
Be aware that based on the specified regular expressions, tagging (matching) can be expensive in terms of processing time due to excessive backtracking. For detailed information about regular expressions in Java see the documentation of java.util.regex.Pattern and the Java tutorial about regular expressions.
Single term level (term based)
Matching is applied on single terms only. Terms are tagged as named entities if at least one of the specified regular expressions matches. If more than one expression is matching the last matching expression overrides previous, conflicting matches. Note: A regular expression must match completely on a term to be tagged. A partial match is not sufficient.
Example: Term "123abc456" matches with regex "\d+[a-z]+\d+" but not with "\d+".
To find terms beginning with a certain string, e.g. "data" use the regular expression "data.*". The expression ".*data" matches to all terms ending with "data". Alternatively when using wildcard expressions "data*" or "*data" can be used.
Multi term level (sentence based):
Matching is applied on sentences. All specified expressions are used for matching. If more than one expression matches, the last matching expression overrides previous, conflicting matches. If multiple terms are matching to one regular expression all terms will be tagged, as long as they do not overlap with the previous term.
Example: Sentence "term1 term2" matches with regex "term\d+" in a way that "term1" and "term2" are tagged as separate terms. Using "[term\d\s]+" as regular expression results in "term1 term2" as one tagged term. Again the regular expression "\d+" would not match.
To find multi words that start with "data", such as "data mining", "data analysis", "data warehouse" for instance the regular expression "data\s+[a-z]+" can be used. With this expression the term "datastore" would not be matching.
The regular expression ".*" matches all terms in a sentence, meaning that the complete sentence will be tagged as one term.
You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.
To use this node in KNIME, install the extension KNIME Textprocessing from the below update site following our NodePit Product and Node Installation Guide:
A zipped version of the software site can be downloaded here.
Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well? Do you think, the search results could be improved or something is missing? Then please get in touch! Alternatively, you can send us an email to firstname.lastname@example.org, follow @NodePit on Twitter, or chat on Gitter!
Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.