Sampling a table is frequently done to create a subset of an original table. Sometimes this is done to improve workflow execution times during development, and it is just a temporary measure. In other cases there is some analytic need to sample the data in some way.
The Row Sampling node's configuration settings are identical to that of the Partitioning node. The only difference is the Partitioning node provides two outputs whereas Row Sampling only provides one.
There are two techniques to determine how many records flow into each output:
- Absolute: You choose a specific number of records
- Relative: You choose a specific percentage of records
Once you determine how many records to pass through each output port, there are four methods by which records can be chosen:
- Take from top: The specified number or percentage of records will come from the first record on down.
- Linear sampling: Includes the first and last rows and then samples every N records based on the selection above (absolute/relative).
- Draw randomly: Based on a random number generator (or the specific seed set below), records are chosen at random. Pick a specific random seed to ensure reproducibility.
- Stratified: Select a column and the output will approximately match the distribution of values in the selected column.
To use this workflow in KNIME, download it from the below URL and open it in KNIME:
Download WorkflowDeploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.
Try NodePit Runner!Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well? Do you think, the search results could be improved or something is missing? Then please get in touch! Alternatively, you can send us an email to mail@nodepit.com.
Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.