Icon

Row Sampling

Sampling a table is frequently done to create a subset of an original table. Sometimes this is done to improve workflow execution times during development, and it is just a temporary measure. In other cases there is some analytic need to sample the data in some way.

The Row Sampling node's configuration settings are identical to that of the Partitioning node. The only difference is the Partitioning node provides two outputs whereas Row Sampling only provides one.

There are two techniques to determine how many records flow into each output:

- Absolute: You choose a specific number of records
- Relative: You choose a specific percentage of records

Once you determine how many records to pass through each output port, there are four methods by which records can be chosen:

- Take from top: The specified number or percentage of records will come from the first record on down.

- Linear sampling: Includes the first and last rows and then samples every N records based on the selection above (absolute/relative).

- Draw randomly: Based on a random number generator (or the specific seed set below), records are chosen at random. Pick a specific random seed to ensure reproducibility.

- Stratified: Select a column and the output will approximately match the distribution of values in the selected column.

Row Sampling Sampling a table is frequently done to create a subset of an original table. Sometimesthis is done to improve workflow execution times during development, and it is just atemporary measure. In other cases there is some analytic need to sample the data insome way. The Row Sampling node's configuration settings are identical to that of the Partitioningnode. The only difference is the Partitioning node provides two outputs whereas RowSampling only provides one.There are two techniques to determine how many records flow into each output:- Absolute: You choose a specific number of records- Relative: You choose a specific percentage of recordsOnce you determine how many records to pass through each output port, there arefour methods by which records can be chosen:- Take from top: The specified number or percentage of records will come from the firstrecord on down. - Linear sampling: Includes the first and last rows and then samples every N recordsbased on the selection above (absolute/relative).- Draw randomly: Based on a random number generator (or the specific seed setbelow), records are chosen at random. Pick a specific random seed to ensurereproducibility.- Stratified: Select a column and the output will approximately match the distribution ofvalues in the selected column. Using an absolute row count and taken from the topensures the output will always contain the first Nrecords from the top of the input table. Night Heron Data, 2023 Absolute records (10 rows) sampled linearly. The firstand last records are returned with 8 records alsooutput at equal intervals from the original table. 70% of records are drawn randomly. Set a randomseed to ensure the random sample is the same everytime the workflow is executed. This is helpful whenyou need to reproduce the results of your workflow. Take a 50% relative row sample using stratifiedsampling based on the Country field. This means that, of the 50% of records returned inthe sample, the distribution of the Country field will beapproximately what it was within the original table. Input some dataSample fromtop withabsolute rowsAbsolute rowslinear samplingRelative rowsdrawn randomlyRelative rowsstratified sampling Table Creator Row Sampling Row Sampling Row Sampling Row Sampling Row Sampling Sampling a table is frequently done to create a subset of an original table. Sometimesthis is done to improve workflow execution times during development, and it is just atemporary measure. In other cases there is some analytic need to sample the data insome way. The Row Sampling node's configuration settings are identical to that of the Partitioningnode. The only difference is the Partitioning node provides two outputs whereas RowSampling only provides one.There are two techniques to determine how many records flow into each output:- Absolute: You choose a specific number of records- Relative: You choose a specific percentage of recordsOnce you determine how many records to pass through each output port, there arefour methods by which records can be chosen:- Take from top: The specified number or percentage of records will come from the firstrecord on down. - Linear sampling: Includes the first and last rows and then samples every N recordsbased on the selection above (absolute/relative).- Draw randomly: Based on a random number generator (or the specific seed setbelow), records are chosen at random. Pick a specific random seed to ensurereproducibility.- Stratified: Select a column and the output will approximately match the distribution ofvalues in the selected column. Using an absolute row count and taken from the topensures the output will always contain the first Nrecords from the top of the input table. Night Heron Data, 2023 Absolute records (10 rows) sampled linearly. The firstand last records are returned with 8 records alsooutput at equal intervals from the original table. 70% of records are drawn randomly. Set a randomseed to ensure the random sample is the same everytime the workflow is executed. This is helpful whenyou need to reproduce the results of your workflow. Take a 50% relative row sample using stratifiedsampling based on the Country field. This means that, of the 50% of records returned inthe sample, the distribution of the Country field will beapproximately what it was within the original table. Input some dataSample fromtop withabsolute rowsAbsolute rowslinear samplingRelative rowsdrawn randomlyRelative rowsstratified sampling Table Creator Row Sampling Row Sampling Row Sampling Row Sampling

Nodes

Extensions

Links