Row Sampling

Sampling a table is frequently done to create a subset of an original table. Sometimes this is done to improve workflow execution times during development, and it is just a temporary measure. In other cases there is some analytic need to sample the data in some way.

The Row Sampling node's configuration settings are identical to that of the Partitioning node. The only difference is the Partitioning node provides two outputs whereas Row Sampling only provides one.

There are two techniques to determine how many records flow into each output:

- Absolute: You choose a specific number of records
- Relative: You choose a specific percentage of records

Once you determine how many records to pass through each output port, there are four methods by which records can be chosen:

- Take from top: The specified number or percentage of records will come from the first record on down.

- Linear sampling: Includes the first and last rows and then samples every N records based on the selection above (absolute/relative).

- Draw randomly: Based on a random number generator (or the specific seed set below), records are chosen at random. Pick a specific random seed to ensure reproducibility.

- Stratified: Select a column and the output will approximately match the distribution of values in the selected column.

Nodes

Extensions

FeatureKNIME Base nodes

Row Sampling

Nodes

Extensions

Links

Download