N-Gram Extractor

Go to Product

This nodes extracts n-grams for a given string. In contrast to the “NGram Creator” node available through the Text Processing plugin, this node works with simple strings and does not require a Document cell type, which makes it easier to use in circumstances which do not require the sophisticated Text Processing infrastructure.

This node uses exactly the same logic which is used by the Palladian Text Classifier nodes.

Options

Text input
Input column which contains text for which to create n-grams.
Drop input column
Enable to exclude the input column in the output table.
Set output column name (*)
Override the default name for the appended output column. Leave empty to auto-generate the name based on the feature settings
n-gram type
The type of n-grams to be used. n-grams can be created on character and word level.
Min. n-gram length
The minimum length of n-grams to create (i.e. the number of characters or words, depending on n-gram type).
Max. n-gram length
The maximum length of n-grams to create.
Min. term length
(Only effective for n-gram type “word”) The minimum length of a word n-gram in characters to be considered.
Max. term length
(Only effective for n-gram type “word”) The maximum length of a word n-gram in characters to be considered.
Max. term count
The maximum number of terms to extract from each document (useful to speed up processing of huge documents, or to reduce model size in general).
Case sensitive
Activate to treat text documents case sensitively (can improve accuracy in certain cases, but increases model size).
Border padding
Create padded character n-grams at text document’s beginning and end (e.g. for a document starting with “The”, and n-gram length 3, we additionally create features the “##T”, “#Th”. This setting can improve accuracy when classifying very short phrases, but increases model size).
Create skip-grams
When in word mode and n-grams length >= 3, additionally create skip grams. Skip grams allow to model gaps between word groups by leaving out words inside the n-gram. E.g. for the consecutive 3-gram “the quick brown”, the skip gram is “the brown”.

Language-specific settings

Language
The language to use for the language-specific processing (see below; only in case the n-gram type is “word”)
Remove stopwords
Removes stopwords based on a predefined stopword list for the given language.
Stem
Performs stemming using the Snowball stemmer.

Input Ports

Icon
Input table with a string column for which to extract the n-grams.

Output Ports

Icon
Table with appended list column which contains the n-grams.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.