String Emoji Filter

** EXPERIMENTAL ** You are welcome to use it (at your own risk). Please check back for improvements in filters

Filter out emoji characters from a string using a built in Regular Expression. This is a proof-of-concept demonstration component. A future version will possibly include the ability to update the regular expression used.

23 April 2021 @takbb Brian Bates
This uses a Java Snippet with a java regex replaceall call, and the following regular expression to identify emoji. This is currently experimental with different "filter types" being used.

Please contact @takbb on the forum if you have suggestions for improvements to the regex, or techniques used

FILTER TYPE 1
**************
filters using the following regex expression and appears to provide limited emoji filtering:
[\\p{C}\\p{So}\uFE00-\uFE0F\\x{E0100}-\\x{E01EF}]

FILTER TYPE 2
**************
uses the following regex expression, and at this time is the most extensive of the filters:
emojiRegex="(?:[\\u2700-\\u27bf]|" +
"(?:[\\ud83c\\udde6-\\ud83c\\uddff]){2}|" +
"[\\ud800\\udc00-\\uDBFF\\uDFFF]|[\\u2600-\\u26FF])[\\ufe0e\\ufe0f]?(?:[\\u0300-\\u036f\\ufe20-\\ufe23\\u20d0-\\u20f0]|[\\ud83c\\udffb-\\ud83c\\udfff])?" +
"(?:\\u200d(?:[^\\ud800-\\udfff]|" +
"(?:[\\ud83c\\udde6-\\ud83c\\uddff]){2}|" +
"[\\ud800\\udc00-\\uDBFF\\uDFFF]|[\\u2600-\\u26FF])[\\ufe0e\\ufe0f]?(?:[\\u0300-\\u036f\\ufe20-\\ufe23\\u20d0-\\u20f0]|[\\ud83c\\udffb-\\ud83c\\udfff])?)*|" +
"[\\u0023-\\u0039]\\ufe0f?\\u20e3|\\u3299|\\u3297|\\u303d|\\u3030|\\u24c2|[\\ud83c\\udd70-\\ud83c\\udd71]|[\\ud83c\\udd7e-\\ud83c\\udd7f]|\\ud83c\\udd8e|[\\ud83c\\udd91-\\ud83c\\udd9a]|[\\ud83c\\udde6-\\ud83c\\uddff]|[\\ud83c\\ude01-\\ud83c\\ude02]|\\ud83c\\ude1a|\\ud83c\\ude2f|[\\ud83c\\ude32-\\ud83c\\ude3a]|[\\ud83c\\ude50-\\ud83c\\ude51]|\\u203c|\\u2049|[\\u25aa-\\u25ab]|\\u25b6|\\u25c0|[\\u25fb-\\u25fe]|\\u00a9|\\u00ae|\\u2122|\\u2139|\\ud83c\\udc04|[\\u2600-\\u26FF]|\\u2b05|\\u2b06|\\u2b07|\\u2b1b|\\u2b1c|\\u2b50|\\u2b55|\\u231a|\\u231b|\\u2328|\\u23cf|[\\u23e9-\\u23f3]|[\\u23f8-\\u23fa]|\\ud83c\\udccf|\\u2934|\\u2935|[\\u2190-\\u21ff]";


FILTER TYPE 3
**************
Filter does not use Regular Expressions but attempts to filter out based on "surrogate pairs" of characters to identify that this is likely to be an Emoji. It filters many usual emoji but does
not find all of them

Options

Specify Column to Filter
This is the column on which to apply the emoji filter. A new String column with the specified name will be created. The source column remains unaffected
Enter filter processing type
Enter Description
Name of output column to append
Give the name of the output column. The default name will be "Emoji Free"
Filter Type (1,2,3)
Specify filter type to be used (see component description for details). Default is 2 which at this time appears to offer the best filtering, but may be overzealous!
Name of output column to append
Give the name of the output column. The default name will be "Emoji Free"

Input Ports

Icon
Data source containing the String column to be parsed

Output Ports

Icon
Pass through of the input data source with an additional column containing the specified column with Emoji characters removed

Nodes

Extensions

Links