0 ×

TextClassifierLearner

StreamablePalladian Nodes for KNIME Workbench version 1.7.0.v201807041014 by palladian.ws; Philipp Katz, Klemens Muthmann, David Urbansky.

This node builds a dictionary from a pre-categorized list of text documents which can then be used to categorize new, uncategorized text documents. This learner builds a weighted term look up table, to learn how probable each n-gram is for a given category. This look up table is used by the corresponding predictor node.

This classifier won the first Research Garden competition where the goal was to classify product descriptions into eight different categories. See press release.

Options

Text input
Column in the input table with the text documents.
Category input
Column in the input table with the pre-assigned categories.
Weight input
(optional) column in the input table which allows weighting differing training documents (a value of 1 means normal weight, for higher values, the learner behaves like adding the particular document n times).
n-gram type
The type of n-grams to be used. n-grams can be created on character and word level.
Min. n-gram length
The minimum length of n-grams to create (i.e. the number of characters or words, depending on n-gram type).
Max. n-gram length
The maximum length of n-grams to create.
Min. term length
(Only effective for n-gram type “word”) The minimum length of a word n-gram in characters to be considered.
Max. term length
(Only effective for n-gram type “word”) The maximum length of a word n-gram in characters to be considered.
Max. term count
The maximum number of terms to extract from each document (useful to speed up processing of huge documents, or to reduce model size in general).
Case sensitive
Activate to treat text documents case sensitively (can improve accuracy in certain cases, but increases model size).
Border padding
Create padded character n-grams at text document’s beginning and end (e.g. for a document starting with “The”, and n-gram length 3, we additionally create features the “##T”, “#Th”. This setting can improve accuracy when classifying very short phrases, but increases model size).
Create skip-grams
When in word mode and n-grams length >= 3, additionally create skip grams. Skip grams allow to model gaps between word groups by leaving out words inside the n-gram. E.g. for the consecutive 3-gram “the quick brown”, the skip gram is “the brown”.

Language-specific settings

Language
The language to use for the language-specific processing (see below; only in case the n-gram type is “word”)
Remove stopwords
Removes stopwords based on a predefined stopword list for the given language.
Stem
Performs stemming using the Snowball stemmer.

Expert settings

Do not stop on memory warnings
Do not listen to KNIME’s memory warnings. Attention: In case this option is enabled, KNIME will become unresponsive when the model size exceeds the memory limit.

Input Ports

Input with pre-categorized text documents. The category has to be given by a separate String column.

Output Ports

The model data of the trained classifier.

Best Friends (Incoming)

Best Friends (Outgoing)

Update Site

To use this node in KNIME, install Palladian Nodes for KNIME Workbench from the following update site:

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform.