Gradient Boosted Trees Learner (Regression)

Learns Gradient Boosted Trees with the objective of regression. The algorithm uses very shallow regression trees and a special form of boosting to build an ensemble of trees. The implementation follows the algorithm in section 4.4 of the paper "Greedy Function Approximation: A Gradient Boosting Machine" by Jerome H. Friedman (1999). For more information you can also take a look at this.

In a regression tree the predicted value for a leaf node is the mean target value of the records within the leaf. Hence the predictions are best (with respect to the training data) if the variance of target values within a leaf is minimal. This is achieved by splits that minimize the sum of squared errors in their respective children.

Sampling

This node allows to perform row sampling (bagging) and attribute sampling (attribute bagging) similar to the random forest* and tree ensemble nodes. If sampling is used this is usually referred to as Stochastic Gradient Boosted Trees. The respective settings can be found in the Advanced Options tab.


(*) RANDOM FORESTS is a registered trademark of Minitab, LLC and is used with Minitab’s permission.

Options

Target Column
Select the column containing the value to be learned. Rows with missing values in this column will be ignored during the learning process.
Attribute Selection

Select the attributes on which the model should be learned. You can choose from two modes.

Fingerprint attribute Uses a fingerprint/vector (bit, byte and double are possible) column to learn the model by treating each entry of the vector as separate attribute (e.g. a bit vector of length 1024 is expanded into 1024 binary attributes). The node requires all vectors to be of the same length.

Column attributes Uses ordinary columns in your table (e.g. String, Double, Integer, etc.) as attributes to learn the model on. The dialog allows to select the columns manually (by moving them to the right panel) or via a wildcard/regex selection (all columns whose names match the wildcard/regex are used for learning). In case of manual selection, the behavior for new columns (i.e. that are not available at the time you configure the node) can be specified as either Enforce exclusion (new columns are excluded and therefore not used for learning) or Enforce inclusion (new columns are included and therefore used for learning).

Limit number of levels (tree depth)
Number of tree levels to be learned. For instance, a value of 1 would only split the (single) root node (decision stump). For gradient boosted trees usually a depth in the range 4 to 10 is sufficient. Larger trees will quickly lead to overfitting.
Number of models
The number of decision trees to learn. A "reasonable" value can range from very few (say 10) to many thousands for small data sets with few target category values. Unlike the random forest algorithm, gradient boosted trees tend to overfit if the number of models is set too high and the learning rate is not low enough.
Learning rate
The learning rate influences how much influence a single model has on the ensemble result. Usually a value of 0.1 is a good starting point but the best learning rate also depends on the number of models. The more models the ensemble contains the lower the learning rate has to be.

Advanced Options

Use mid points splits (only for numeric attributes)
Uses for numerical splits the middle point between two class boundaries. If unselected the split attribute value is the smaller value with "<=" relationship.
Use binary splits for nominal columns
If this option is checked (this is the default), then nominal columns are split in a binary way using set based splits. The algorithm for determining the best binary split is described in section 8.8 of "Classification and Regression Trees" by Breiman et al. (1984). If this option is unchecked, the algorithm will produce a child for each possible value of the nominal column.
Missing value handling
Here the preferred missing value handling can be specified there are the following options:
  • XGBoost - If this is selected (it is also the default), the learner will calculate which direction is best suited for missing values, by sending the missing values in each direction of a split. The direction that yields the best result (i.e. largest gain) is then used as default direction for missing values. This method works with both, binary and multiway splits.
  • Surrogate - This approach calculates for each split alternative splits that best approximate the best split. The method was first described in the book "Classification and Regression Trees" by Breiman et al. (1984). NOTE: This method can only be used with binary nominal splits.
Alpha
Alpha controls what percentage of the data will be considered as outliers. The higher Alpha the smaller the fraction of outliers. If Alpha is set to 1.0, the algorithm will consider no point to be an outlier. This is discouraged however because outliers can have fatal effects on regression.
Data Sampling (Rows)
Sampling the rows is also known as bagging, a very popular ensemble learning strategy. The sampling of the data rows for each individual tree: If disabled each tree learner gets the full data set, otherwise each tree is learned with a different data sample. A data fraction of 1 (=100%) chosen "with replacement" is called bootstrapping. For sufficiently large data sets this bootstrap sample contains about 2/3 different data rows from the input, some of which replicated multiple times. Rows which are not used in the training of a tree are called out-of-bag (see below).
Attribute Sampling (Columns)
Attribute sampling is also called random subspace method or attribute bagging. Its most famous application are random forests but it can also be used for gradient boosted trees. This option specifies the sample size:

All columns (no sampling) Each sample consists of all columns which corresponds to no sampling at all.

Sample (square root) Use the square root of the total number of attributes as sample size. This method is typically used in random forests.

Sample (linear fraction) Use the specified linear fraction of the total number of attributes as sample size. A linear fraction of 0.5 corresponds to using 50% of all attributes.

Sample (absolute value) Use the specified number as sample size.

Attribute Selection
In this context attribute selection refers to the scale at which attributes are sampled (per tree vs. per tree node). Note that this only takes effect if attribute sampling is enabled.

Use same set of attributes for each tree With this option the attributes are sampled per tree. That means that we draw an attribute sample and use it to learn an individual tree so every node of this tree sees the same attributes.

Use different set of attributes for each tree node This strategy draws a new attribute sample per tree node. A random forest typically uses this strategy to make the trees more diverse. (Note that diversity is not important for gradient boosted trees so the effect won't be as large)

Use static random seed
Choose a seed to get reproducible results.

Input Ports

Icon
The data to learn from. It must contain at least one numeric target column and either a fingerprint (bitvector) column or another numeric or nominal column.

Output Ports

Icon
The trained model.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.