SMOTE

This node oversamples the input data (i.e. adds artificial rows) to enrich the training data. The applied technique is called SMOTE (Synthetic Minority Over-sampling Technique) by Chawla et al.

Some supervised learning algorithms (such as decision trees and neural nets) require an equal class distribution to generalize well, i.e. to get good classification performance. In case of unbalanced input data, for instance there are only few objects of the "active" but many of the "inactive" class, this node adjusts the class distribution by adding artificial rows (in the example by adding rows for the "active" class).

The algorithm works roughly as follows: It creates synthetic rows by extrapolating between a real object of a given class (in the above example "active") and one of its nearest neighbors (of the same class). It then picks a point along the line between these two objects and determines the attributes (cell values) of the new object based on this randomly chosen point.

Options

Class Column
Pick the column that contains the class information.
Nearest neighbor
An option that determines how many nearest neighbors shall be considered. The algorithm picks an object from the target class, randomly selects one of its neighbors and draws the new synthetic example along the line between the sample and the neighbor.
Oversample by
Checking this option oversamples each class equally. You need to specify how much synthetic data is introduced, e.g. a value of 2 will introduce two more portions for each class (if there are 50 rows in the input table labeled as "A"; the output will contain 150 rows belonging to "A").
Oversample minority classes
This option adds synthetic examples to all classes that are not the majority class. The output contains the same number of rows for each of the possible classes.
Enable static seed
Check this option if you want to use a seed for the random number generator. This will cause consecutive runs of the node to produce the same output data. If unchecked, each run of the node generates a new seed. Use "Draw new seed" to randomly draw a new seed.

Input Ports

Icon
Table containing labeled data for oversampling.

Output Ports

Icon
Oversampled data (input table with appended rows).

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.