Create Bit Vector

Generates for each row of a given input table a bit vector. The bit vectors are either generated from multiple numerical or string columns, a string column containing the bit positions to set, hexadecimal or binary strings or a collection column. In order to adjust the node settings please select first the source column object e.g. if the bit vector should be created from multiple string/numerical columns or from a single string/collection column. Depending on the selected option the corresponding dialog elements are enabled.

Bit vectors from multiple columns

In the case of multiple columns the bit positions in the resulting bit vector correspond to the column position in the input table. For example if the second and third column of a given input table is selected and the first column is omitted the bit vectors of each row will have length 2. The first bit of the bit vector is set if the value of the second column matches the selected criterion likewise the second bit of the bit vector is set if the value of the third column matches the selected criterion. The columns to consider when creating the bit vector can be specified in the multiple column selection section. Using the enforce exclusion/inclusion option the node can be configured to handle previously unknown columns. If the enforce exclusion option is selected all unknown columns are added automatically to the include list whereas if the enforce inclusion option is selected all unknown columns are added to the exclude list. The columns to include can be also defined by a pattern if the Wildcard/Regex Selection option is selected.

Multiple string columns

The bit of a vector is set if the corresponding column value does match/does not match the specified pattern depending on the "Set bit if pattern does match/does not match" option. The pattern may contain wildcards such as '?' or '*' to match any one character or any sequence (including none) of characters. It can also be a complex regular expression.

Multiple numeric columns

There are two options to determine if the bit is set for the value in the corresponding column or not:
  • either a global threshold is defined, then all values which are above or equal to the threshold are converted into set bits, all other bit positions remain 0, or
  • a certain percentage of the mean of each column is used as a threshold, then all values which are above or equal to the percentage of the mean are converted into set bits. As an example let's say the mean percentage is set to 50% and the mean of col1 is 2 and the mean of col2 is 8. Then the corresponding bit for col1 is set if the value is above or equal to 1 and for col2 if the value is above or equal to 4.

Bit vectors from a single column

In the case of a single input column only the selected single column to be parsed is considered for the generation of the bit vectors.

Single string column

In the case of a string input only the column containing the string is considered for the generation of the bit vectors. The string is parsed and converted into a bit vector. There are three valid input formats which can be parsed and converted:
  • Hexadecimal strings: strings consisting only of the characters 0-9 and A - F (where lower- or uppercase is not important). The represented hexadecimal number is converted into a binary number which is represented by the resulting bit vector.
  • Binary strings: strings consisting only of 0s and 1s are parsed and converted into the according bit vectors.
  • ID strings: strings consisting of numbers (separated by spaces) where the numbers refer to those positions in the bit vector which should be set. (Typical input format for association rule mining).

Single collection column

In the case of a single collection column each unique collection element gets a bit position assigned. The length of the bit vectors corresponds to the number of unique elements in a collection cells. For example if the input table contains two rows with the collections {a,b} and {b,c} the corresponding bit vectors will be [110] and [011].

Missing values

For numeric data the incoming missing values will result in 0s. For multiple string columns a missing values will also result in 0s. For the string input missing values will also result in a missing value in the output table. If a string could not be parsed it will also result in a missing cell in the output table and an error message with detailed information is printed in the console. For a collection column all missing collection elements are ignored.

Options

Create bit vectors from multiple string columns

Pattern
The pattern to search for in the data value
Contains wildcards
Select this option to use wild cards in the pattern. Wildcard patterns contain '*' (matching any sequence of characters) and '?' (matching any one character).
Regular expression
Select this option to specify a regular expression. Examples of regular expressions are given below. "^foo.*" matches anything that starts with "foo". The '^'-character stands for the beginning of the word, the dot matches any (one) character, and the asterisk allows any number (including zero) of the previous character.
"[0-9]*" matches any string of digits (including the empty string). The [ ] define a set of characters (they could be added individually like [0123456789], or by range). This set matches any (one) character included in the set.
For a complete explanation of regular expressions see e.g. the JavaDoc of the java.util.regex.Pattern class.
Case sensitive match
A case sensitive matching is performed if this option is selected.
Set bit if pattern
Depending on the selected option the corresponding bit in the bit vector is set if the pattern either does match or does not match
Multiple column selection panel
Select the string columns to convert to a bit vector

Create bit vectors from multiple numeric columns

Threshold
If the "numeric input" is checked, specify the global threshold. All values which are above or equal to this threshold will result in a 1 in the bit vector.
Use percentage of the mean
Check, if a percentage of the mean of each column should serve as threshold above which the bits are set.
Percentage
Specify which percentage of the mean a value should have in order to be set.
Multiple column selection panel
Select the numeric columns to convert to a bit vector

Create bit vectors from a single string column

Kind of string representation
Select one of the three valid input formats: HEX (hexadecimal), ID (bit positions) or BIT (binary strings). See description above.
Single column to be parsed
The string column to parse

Create bit vectors from a single collection column

Single column to be parsed
The collection column to parse

General options

Remove column(s) used for bit vector creation:
If it is checked the generating column(s) (included columns if numeric input was used or the selected string column) are removed. If it is unchecked the generated bit vectors are appended to the input table.
Output column
Name of the output column.
Fail on invalid input
If selected, the node will fail during execution if a data cell could not be converted to a bit set. If unselected, the node will skip these invalid entries and insert a missing value instead.
Bit vector type
The dense vector type stores is the default method that stores each vector position with a single bit thus requiring the same amount of bits for all vectors e.g. the vector length in bits. The sparse vector type stores only the indices of the set bits thus depends the space required to store a bit vector depends on the number of set bits. Each set bit requires between 32 and 64 bits depending on the operating system. Therefore the sparse option should be only selected if the majority of the bit vectors contain only few set bits e.g. less than 10%.

Input Ports

Icon
Data table with numerical data or a string column to be parsed.

Output Ports

Icon
Data table with the generated bit vectors.

Popular Predecessors

Views

Statistics View
Provides information about the generation of the bit vectors from the data. In particular this is the number of processed rows, the total number of generated zeros and ones and the resulting ratio of 1s to 0s.

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.