String Character Class and Emoji Filter
Filter out emoji and other classes of characters from a string using a built in Regular Expression. This component can be used as a replacement to my original String Emoji Filter.
v1 - 17 September 2022 @takbb Brian Bates
v1.1 23 September 2022 - bug fix - "Connector Punctuation" was removing spaces
For examples of the different "Unicode Character Classes" see
https://www.compart.com/en/unicode/category
This uses a Java Snippet with a java regex replaceall call, and inbuilt regular expressions to filter characters from a string based on pre-defined classes.
Choose the class, or classes of characters to be filtered from ths list provided. The filter converts the selected class names into regex character classes and then removes these using a java snippet.
Additional character classes and/or regex patterns may be added over time. Please let me know if specific character classes don't appear to work.
This is based on a subset of the character classes described here: https://www.regular-expressions.info/unicode.html in the section "Unicode Categories"
The list of categories that are currently implemented, with their unicode equivalent are here. Please see regex documentation on the internet to describe those categories. As this component uses Java, it is the java implementation of these regex patterns that is being utilised.
Unassigned Characters%%00009\p{Cn}
Formatting Indicators%%00009\p{Cf}
Control Characters%%00009\p{C}\p{Cc}
Half of UTF-16 Surrogate pair%%00009\p{Cs}
Codepoints Reserved for Private Use%%00009\p{Co}
Punctuation - Other%%00009\p{Po}
Symbols%%00009\p{S}
Symbols - Emoji and Other%%00009\p{So}
Symbols - Currency%%00009\p{Sc}
Symbols - Modifiers%%00009\p{Sk}
Mathematical Symbols%%00009\p{Sm}
Letters%%00009\p{L}
Letters - Upper Case%%00009\p{Lu}
Letters - Lower case%%00009\p{Ll}
Numbers%%00009\p{N}
Character Marks%%00009\p{M}
Non Spacing Marks%%00009\p{Mn}
Enclosing Marks%%00009\p{Me}
Separators%%00009\p{Z}
Space Separator%%00009\p{Zs}
Line Separator%%00009\p{Zl}
Paragraph Separators%%00009\p{Zp}
Other Numbers (e.g. superscript digits)%%00009\p{No}
Punctuation%%00009\p{P}
Dash Punctuation%%00009\p{Pd}
Connector Punctuation%%00009\p{Pc}
Please contact @takbb on the forum if you have suggestions for improvements or additional useful filter-classes
To use this component in KNIME, download it from the below URL and open it in KNIME:
Download ComponentDeploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.
Try NodePit Runner!Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well? Do you think, the search results could be improved or something is missing? Then please get in touch! Alternatively, you can send us an email to mail@nodepit.com.
Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.