0 ×

N-Gram Extractor

StreamablePalladian Nodes for KNIME Workbench version 1.8.0.201907271536 by palladian.ws; Philipp Katz, Klemens Muthmann, David Urbansky.

This nodes extracts n-grams for a given string. In contrast to the “NGram Creator” node available through the Text Processing plugin, this node works with simple strings and does not require a Document cell type, which makes it easier to use in circumstances which do not require the sophisticated Text Processing infrastructure.

This node uses exactly the same logic which is used by the Palladian Text Classifier nodes.

Options

Text input
Input column which contains text for which to create n-grams.
n-gram type
The type of n-grams to be used. n-grams can be created on character and word level.
Min. n-gram length
The minimum length of n-grams to create (i.e. the number of characters or words, depending on n-gram type).
Max. n-gram length
The maximum length of n-grams to create.
Min. term length
(Only effective for n-gram type “word”) The minimum length of a word n-gram in characters to be considered.
Max. term length
(Only effective for n-gram type “word”) The maximum length of a word n-gram in characters to be considered.
Max. term count
The maximum number of terms to extract from each document (useful to speed up processing of huge documents, or to reduce model size in general).
Case sensitive
Activate to treat text documents case sensitively (can improve accuracy in certain cases, but increases model size).
Border padding
Create padded character n-grams at text document’s beginning and end (e.g. for a document starting with “The”, and n-gram length 3, we additionally create features the “##T”, “#Th”. This setting can improve accuracy when classifying very short phrases, but increases model size).
Create skip-grams
When in word mode and n-grams length >= 3, additionally create skip grams. Skip grams allow to model gaps between word groups by leaving out words inside the n-gram. E.g. for the consecutive 3-gram “the quick brown”, the skip gram is “the brown”.

Language-specific settings

Language
The language to use for the language-specific processing (see below; only in case the n-gram type is “word”)
Remove stopwords
Removes stopwords based on a predefined stopword list for the given language.
Stem
Performs stemming using the Snowball stemmer.

Input Ports

Input table with a string column for which to extract the n-grams.

Output Ports

Table with appended list column which contains the n-grams.

Best Friends (Incoming)

Best Friends (Outgoing)

Installation

To use this node in KNIME, install Palladian for KNIME from the following update site:

KNIME 4.1
Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.