Equal Size Sampling

Removes rows from the input data set such that the values in a categorical column are equally distributed. This can be useful, for instance if a learning algorithm is prone to unequal class distributions and you want to downsize the data set so that the class attributes occur equally often in the data set.

The node will remove random rows belonging to the majority classes. The rows returned by this node will contain all records from the minority class(es) and a random sample from each of the majority classes, whereby each sample contains as many objects as the minority class contains.

Options

Nominal Column
Select the class column here. The node will run over the data set once to count the occurrences in this selected column and then do the filtering in a second pass. Note that missing values in this column are treated as a separate category (can also build the minority class).
Use exact sampling
If selected, the final output will be determined up-front. Each class will have the same number of instances in the output table. This sampling is slightly more memory expensive as each class will need to be represented by a bit set containing instances of the corresponding rows. In most cases it is save to select this option unless you have very large data with many different class labels.
Use approximate sampling
If selected, the final output will be determined on the fly. The number of occurrences of each class may slightly differ as the final number can't be determined beforehand.
Enable static seed
If selected, the removal of rows is driven by a static seed (result is reproducable).

Input Ports

Icon
Arbitrary input data.

Output Ports

Icon
The input data with fewer rows.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.