Icon

Partitioning

Splitting a table into two outputs with a set number or percentage of records is a common process, especially when preparing data for predictive modeling. A common action is to split a table into two tables at random with 70% and 30% of the original records. These are frequently referred to as 'training' and 'testing' sets. This action is most easily accomplished using the Partitioning node.

There are two techniques to determine how many records flow into each output:

- Absolute: You choose a specific number of records
- Relative: You choose a specific percentage of records

Once you determine how many records to pass through each output port, there are four methods by which records can be chosen:

- Take from top: The specified number or percentage of records will come from the first record on down.

- Linear sampling: Includes the first and last rows and then samples every N records based on the selection above (absolute/relative).

- Draw randomly: Based on a random number generator (or the specific seed set below), records are chosen at random. Pick a specific random seed to ensure reproducibility.

- Stratified: Select a column and the output will approximately match the distribution of values in the selected column.

Partitioning Splitting a table into two outputs with a set number or percentage of records is acommon process, especially when preparing data for predictive modeling. A commonaction is to split a table into two tables at random with 70% and 30% of the originalrecords. These are frequently referred to as 'training' and 'testing' sets. This action ismost easily accomplished using the Partitioning node.There are two techniques to determine how many records flow into each output:- Absolute: You choose a specific number of records- Relative: You choose a specific percentage of recordsOnce you determine how many records to pass through each output port, there arefour methods by which records can be chosen:- Take from top: The specified number or percentage of records will come from the firstrecord on down. - Linear sampling: Includes the first and last rows and then samples every N recordsbased on the selection above (absolute/relative).- Draw randomly: Based on a random number generator (or the specific seed setbelow), records are chosen at random. Pick a specific random seed to ensurereproducibility.- Stratified: Select a column and the output will approximately match the distribution ofvalues in the selected column. Using an absolute row count and taken from the topensures the upper port will always contain the first Nrecords from the top of the input table. All otherrecords are sent to the lower output table. Night Heron Data, 2023 Absolute records (10 rows) sampled linearly. The firstand last records are returned with 8 records alsooutput at equal intervals from the original table. 70% of records are drawn randomly. Set a randomseed to ensure the random sample is the same everytime the workflow is executed. This is helpful whenyou need to reproduce the results of your workflow. Take a 50% relative row sample using stratifiedsampling based on the Country field. This means that, of the 50% of records returned inthe sample, the distribution of the Country field will beapproximately what it was within the original table. Input some dataSplit data fromtop withabsolute rowsAbsolute rowslinear samplingRelative rowsdrawn randomlyRelative rowsstratified sampling Table Creator Partitioning Partitioning Partitioning Partitioning Partitioning Splitting a table into two outputs with a set number or percentage of records is acommon process, especially when preparing data for predictive modeling. A commonaction is to split a table into two tables at random with 70% and 30% of the originalrecords. These are frequently referred to as 'training' and 'testing' sets. This action ismost easily accomplished using the Partitioning node.There are two techniques to determine how many records flow into each output:- Absolute: You choose a specific number of records- Relative: You choose a specific percentage of recordsOnce you determine how many records to pass through each output port, there arefour methods by which records can be chosen:- Take from top: The specified number or percentage of records will come from the firstrecord on down. - Linear sampling: Includes the first and last rows and then samples every N recordsbased on the selection above (absolute/relative).- Draw randomly: Based on a random number generator (or the specific seed setbelow), records are chosen at random. Pick a specific random seed to ensurereproducibility.- Stratified: Select a column and the output will approximately match the distribution ofvalues in the selected column. Using an absolute row count and taken from the topensures the upper port will always contain the first Nrecords from the top of the input table. All otherrecords are sent to the lower output table. Night Heron Data, 2023 Absolute records (10 rows) sampled linearly. The firstand last records are returned with 8 records alsooutput at equal intervals from the original table. 70% of records are drawn randomly. Set a randomseed to ensure the random sample is the same everytime the workflow is executed. This is helpful whenyou need to reproduce the results of your workflow. Take a 50% relative row sample using stratifiedsampling based on the Country field. This means that, of the 50% of records returned inthe sample, the distribution of the Country field will beapproximately what it was within the original table. Input some dataSplit data fromtop withabsolute rowsAbsolute rowslinear samplingRelative rowsdrawn randomlyRelative rowsstratified sampling Table Creator Partitioning Partitioning Partitioning Partitioning

Nodes

Extensions

Links