Feature Selection Loop Start (2:2)

This node is the start of the feature selection loop. The feature selection loop allows you to select, from all the features in the input data set, the subset of features that is best for model construction. With this node you determine (i) which features/columns are to be held fixed in the selection process. These constant or "static" features/columns are included in each loop iteration and are exempt from elimination; (ii) which selection strategy is to be used on the other (variable) features/columns and its settings; and (iii) the specific settings of the selected strategy. This node has two in and out ports. The respective first port is intended for training data and the second port for test data. The same filter is applied to both tables and they will therefore always contain the same columns.

The following feature selection strategies are available:

  • Forward Feature Selection is an iterative approach. It starts with having no feature selected. In each iteration, the feature that improves the model the most is added to the feature set.
  • Backward Feature Elimination is an iterative approach. It starts with having all features selected. In each iteration, the feature that has on its removal the least impact on the models performance is removed.
  • Genetic Algorithm is a stochastic approach that bases its optimization on the mechanics of biological evolution and genetics. Similar to natural selection, different solutions (individuals) are carried and mutated from generation to generation based on their performance (fitness). This approach converges into a local optimum and enabling early stopping might be recommended. See, e.g., this article for more insights.
  • Random is a simple approach that selects feature combinations randomly. There is no converging and by chance (one of) the best feature combination will be drawn in an early iteration, so that early stopping might be recommended.

Options

Static and Variable Features
Columns can be selected manually or by means of regular expressions. The columns in the left pane are the static columns, those in the right pane the variable columns. If you want to learn a supervised model (e.g. classification or regression), at least one static column and more than one variable column will be needed. For an unsupervised model (e.g. clustering), no constant column but only variable columns will be needed. Columns can be moved from one pane to the other by clicking on the appropriate button in the middle.
Feature selection strategy
Here you can choose between the selection strategies: Forward Feature Selection, Backward Feature Elimination, Genetic Algorithm and Random.
Use threshold for number of features
[Forward Feature Selection, Backward Feature Elimination] Check this option if you want to set a bound for the number of selected features. Since Forward Feature Selection adds features while Backward Feature Elimination subtracts them, this will be an upper bound for Forward Feature Selection and a lower bound for Backward Feature Elimination.
Select threshold for number of features
[Forward Feature Selection, Backward Feature Elimination] Set the upper or lower bound for the number of selected features.
Use lower bound for number of features
[Genetic Algorithm, Random] Check this option if you want to set a lower bound for the number of selected features.
Use upper bound for number of features
[Genetic Algorithm, Random] Check this option if you want to set an upper bound for the number of selected features.
Population size
[Genetic Algorithm] Set the number of individuals in each population. Changing this value directly influences the maximal number of loop iterations which is Population size * (Number of generations + 1) . This is just an upper bound, usually less iterations will be necessary.
Max. number of generations
[Genetic Algorithm] Set the number of generations. Changing this value directly influences the maximal number of loop iterations which is Population size * (Number of generations + 1) . This is just an upper bound, usually less iterations will be necessary.
Max. number of iterations
[Random] Set the number of iterations. This is an upper bound. If the same feature subset is randomly generated for a second time, it won't be processed again but will be counted as iteration. Furthermore, if early stopping is enabled, the algorithm may stop before the max. number of iterations is reached.
Use static random seed
[Genetic Algorithm, Random] Choose a seed to get reproducible results.

Advanced Options

Selection strategy
[Genetic Algorithm] Choose the strategy to use for the selection of offspring .
Fraction of survivors
[Genetic Algorithm] Set the fraction of survivors during evaluation of the next generation. 1 - fraction of survivors defines the fraction of offspring which is evaluated for the next generation.
Elitism rate
[Genetic Algorithm] Set the fraction of the best individuals within a generation that are transfered to the next generation without alternation.
Crossover strategy
[Genetic Algorithm] Choose the strategy to use for crossover .
Crossover rate
[Genetic Algorithm] Set the crossover rate used to alter offspring.
Mutation rate
[Genetic Algorithm] Set the mutation rate used to alter offspring.
Enable early stopping
[Genetic Algorithm, Random] Check this option if you want to enable early stopping which means that the algorithm stops after a specified number of generations/iterations without improvement. If using the random strategy, this is based on a moving average whereby the size of the moving window is the same number as the specified number of iterations. If the ratio of improvement is lower than a specified tolerance, the search stops.
Number of generations/iterations without improvement
[Genetic Algorithm, Random] Set the number of generations/iterations without improvement (or with less improvement than the specified tolerance in case of random strategy) used for early stopping. In case of random strategy it also defines the size of the moving window.
Tolerance
[Random] The tolerance used for early stopping which defines the threshold for the ratio of improvement. If the ratio is lower than the threshold, the strategy stops.

Input Ports

Icon
A data table containing all features and static columns needed for the feature selection. (Trainingdata)
Icon
A data table containing all features and static columns needed for the feature selection. (Testdata)

Output Ports

Icon
The input table with some columns filtered out. (Training data)
Icon
The input table with some columns filtered out. (Test data)

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.