Document vector hashing

This Node Is Deprecated — This node is kept for backwards-compatibility, but the usage in new workflows is no longer recommended. The documentation below might contain more information.

This node creates a document vector for each document representing it in the terms space. The values of the feature vectors can be specified as boolean values or as values of either the relative frequency or the absolute frequency of the terms. The advantages of using this node instead of the normal document vector node is that the dimension of the vectors is always fixed and therefore this node is streamable.

Options

Document column
The column containing the documents to use.
Dimension
The dimension of the output document vector. The bigger the dimension, the less likely collisions would tend to happen. However, be aware of the curse of dimensionality.
Seed
Seed for the hashing function.
Hashing function
Choose which hashing function should be used to hash the document terms.
Vector type
There are three ways to fill the values in the document vector.
Binary : The vector will be a bit vector.
TF-Absolute : At each index where a term is hashed to, the value of the absolute term frequency of that term will calculated and stored at the index.
TF-Relative : At each index where a term is hashed to, the value of the relative term frequency of that term will calculated and stored at the index.
As collection cell
If checked all vector entries will be stored in a collection cell consisting of double cells. If not checked all double cells will be stored in corresponding columns. The advantage of the column representation is that most of the regular algorithms in KNIME can be applied. The disadvantage is (which is on the other hand the advantage of the collection representation) that processing of subsequent nodes will be slowed down, due to the many columns that will be created (dependent on the input data of course).

Input Ports

Icon
The input table containing the documents.

Output Ports

Icon
An output table containing the input documents with the corresponding document vectors.

Views

This node has no views

Workflows

  • No workflows found

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.