String Emoji and Character Class Filter

String Character Class and Emoji Filter

Filter out emoji and other classes of characters from a string using a built in Regular Expression. This component can be used as a replacement to my original String Emoji Filter.


v1 - 17 September 2022 @takbb Brian Bates
v1.1 23 September 2022 - bug fix - "Connector Punctuation" was removing spaces

For examples of the different "Unicode Character Classes" see
https://www.compart.com/en/unicode/category

This uses a Java Snippet with a java regex replaceall call, and inbuilt regular expressions to filter characters from a string based on pre-defined classes.

Choose the class, or classes of characters to be filtered from ths list provided. The filter converts the selected class names into regex character classes and then removes these using a java snippet.

Additional character classes and/or regex patterns may be added over time. Please let me know if specific character classes don't appear to work.

This is based on a subset of the character classes described here: https://www.regular-expressions.info/unicode.html in the section "Unicode Categories"

The list of categories that are currently implemented, with their unicode equivalent are here. Please see regex documentation on the internet to describe those categories. As this component uses Java, it is the java implementation of these regex patterns that is being utilised.

Unassigned Characters%%00009\p{Cn}
Formatting Indicators%%00009\p{Cf}
Control Characters%%00009\p{C}\p{Cc}
Half of UTF-16 Surrogate pair%%00009\p{Cs}
Codepoints Reserved for Private Use%%00009\p{Co}
Punctuation - Other%%00009\p{Po}
Symbols%%00009\p{S}
Symbols - Emoji and Other%%00009\p{So}
Symbols - Currency%%00009\p{Sc}
Symbols - Modifiers%%00009\p{Sk}
Mathematical Symbols%%00009\p{Sm}
Letters%%00009\p{L}
Letters - Upper Case%%00009\p{Lu}
Letters - Lower case%%00009\p{Ll}
Numbers%%00009\p{N}
Character Marks%%00009\p{M}
Non Spacing Marks%%00009\p{Mn}
Enclosing Marks%%00009\p{Me}
Separators%%00009\p{Z}
Space Separator%%00009\p{Zs}
Line Separator%%00009\p{Zl}
Paragraph Separators%%00009\p{Zp}
Other Numbers (e.g. superscript digits)%%00009\p{No}
Punctuation%%00009\p{P}
Dash Punctuation%%00009\p{Pd}
Connector Punctuation%%00009\p{Pc}



Please contact @takbb on the forum if you have suggestions for improvements or additional useful filter-classes

Options

Category Filters to be applied (included) If limited set (or no classes) displayed, click (?) HELP button below and scroll down to DIALOG OPTIONS
Include all category filters that are to be applied to the column. %%00010If you do not wish to apply a particular filter, move it to the left (EXCLUDE the filter)%%00010%%00010If you wish to apply a filter, move it to the right (INCLUDE the filter)%%00010NOTE: %%00010If only a limited set of character classes are displaying (or none), make sure you have selected a column, then close this config, execute the node and then open this configuration dialog again. This is a limitation of the way the component populates the selection list box.%%00010%%00010
Column to filter
Choose the column to be processed. All selected regex filters will be applied to the column resulting in removal of the chosen character classes.
Output Column Name
Specify the name of a new output column

Input Ports

Icon
The data table to be filtered

Output Ports

Icon
The data table with filtering applied

Nodes

Extensions

Links