Text Classifier Learner

Go to Product

This node builds a dictionary from a pre-categorized list of text documents which can then be used to categorize new, uncategorized text documents. This learner builds a weighted term look up table, to learn how probable each n-gram is for a given category. This look up table is used by the corresponding predictor node.

This classifier won the first Research Garden competition where the goal was to classify product descriptions into eight different categories. See press release (on archive.org).

Feature Settings

Features are the input for a classifier. In text classification, we have a long string as an input from which we need to derive features during preprocessing. Palladian’s text classifier works with n-grams. n-grams are sets of tokens of the length n, which are created by sliding a “window” over the given text. The PalladianTextClassifierLearner node can create features using character- or word-based n-grams. As an example, consider the text “the quick brown fox”:
  • The set of word-based 2-grams would contain the following entries: {“the quick”, “quick brown”, “brown fox”}.
  • The set of character-5-grams consists of the following entries: {“the q”, “he qu”, “e qui”, “ quic”, “quick”, …}.
  • It is possible to combine n-grams of different lengths. For example, the set of character-4-6-grams contains the union of the sets of 4-, 5-, and 6-grams.

Options

Text input
Column in the input table with the text documents.
Category input
Column in the input table with the pre-assigned categories.
Weight input
(optional) column in the input table which allows weighting differing training documents (a value of 1 means normal weight, for higher values, the learner behaves like adding the particular document n times).
n-gram type
The type of n-grams to be used. n-grams can be created on character and word level.
Min. n-gram length
The minimum length of n-grams to create (i.e. the number of characters or words, depending on n-gram type).
Max. n-gram length
The maximum length of n-grams to create.
Min. term length
(Only effective for n-gram type “word”) The minimum length of a word n-gram in characters to be considered.
Max. term length
(Only effective for n-gram type “word”) The maximum length of a word n-gram in characters to be considered.
Max. term count
The maximum number of terms to extract from each document (useful to speed up processing of huge documents, or to reduce model size in general).
Case sensitive
Activate to treat text documents case sensitively (can improve accuracy in certain cases, but increases model size).
Border padding
Create padded character n-grams at text document’s beginning and end (e.g. for a document starting with “The”, and n-gram length 3, we additionally create features the “##T”, “#Th”. This setting can improve accuracy when classifying very short phrases, but increases model size).
Create skip-grams
When in word mode and n-grams length >= 3, additionally create skip grams. Skip grams allow to model gaps between word groups by leaving out words inside the n-gram. E.g. for the consecutive 3-gram “the quick brown”, the skip gram is “the brown”.

Language-specific settings

Language
The language to use for the language-specific processing (see below; only in case the n-gram type is “word”)
Remove stopwords
Removes stopwords based on a predefined stopword list for the given language.
Stem
Performs stemming using the Snowball stemmer.

Expert settings

Do not stop on memory warnings
Do not listen to KNIME’s memory warnings. Attention: In case this option is enabled, KNIME will become unresponsive when the model size exceeds the memory limit.

Input Ports

Icon
Input with pre-categorized text documents. The category has to be given by a separate String column.

Output Ports

Icon
The model data of the trained classifier.

Popular Predecessors

Views

This node has no views

Workflows

  • No workflows found

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.