0 ×

Document Vector Hashing

StreamableKNIME Textprocessing Plug-in version 4.2.1.v202008251908 by KNIME AG, Zurich, Switzerland

This node creates a document vector for each document representing it in the terms space. The values of the feature vectors can be specified as boolean values or as values of either the relative frequency or the absolute frequency of the terms. The advantages of using this node instead of the normal document vector node is that the dimension of the vectors is always fixed and therefore this node is streamable.

Options

Document column
The column containing the documents to use.
Dimension
The dimension of the output document vector. The bigger the dimension, the less likely collisions would tend to happen. However, be aware of the curse of dimensionality.
Seed
Seed for the hashing function.
Hashing function
Choose which hashing function should be used to hash the document terms.
Vector type
There are three ways to fill the values in the document vector.
Binary : The vector will be a bit vector.
TF-Absolute : At each index where a term is hashed to, the value of the absolute term frequency of that term will calculated and stored at the index.
TF-Relative : At each index where a term is hashed to, the value of the relative term frequency of that term will calculated and stored at the index.
As collection cell
If checked all vector entries will be stored in a collection cell consisting of double cells. If not checked all double cells will be stored in corresponding columns. The advantage of the column representation is that most of the regular algorithms in KNIME can be applied. The disadvantage is (which is on the other hand the advantage of the collection representation) that processing of subsequent nodes will be slowed down, due to the many columns that will be created (dependent on the input data of course).

Input Ports

Icon
The input table containing the documents.

Output Ports

Icon
An output table containing the input documents with the corresponding document vectors.
Icon
The model output containing the specifications that have been used for document vector creation.

Best Friends (Incoming)

Best Friends (Outgoing)

Workflows

Installation

To use this node in KNIME, install KNIME Textprocessing from the following update site:

KNIME 4.2

A zipped version of the software site can be downloaded here. Read our FAQs to get instructions about how to install nodes from a zipped update site.

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.