XGBoost Tree Ensemble Learner (Regression)

Learns a tree based XGBoost model for regression. XGBoost is a popular machine learning library that is based on the ideas of boosting. Checkout the official documentation for some tutorials on how XGBoost works. Since XGBoost requires its features to be single precision floats, we automatically cast double precision values to float, which can cause problems for extreme numbers.

Options

Objective
One of
  • linear
  • logistic
  • gamma
  • poisson
  • tweedie
Tweedie regression variance
Controls the variance of the Tweedie distribution. Must be in the range (1, 2) and is by default set to 1.5.
Target column
The column containing the regression target.
Weight column
The column containing the row weights (also called sample weights or instance weights). Note that the selected column must not contain missing values.
Feature columns
Allows to select which columns should be used as features in training. Note that the domain of nominal features must contain the possible values otherwise the node can't be executed. Use the Domain Calculator node to calculate any missing possible value sets.
Boosting rounds
The number of models to train in the boosting ensemble.
Base score
The initial prediction score of all instances; this global bias will have little effect for a sufficiently large number of iterations.
Use static random seed
If checked, the seed displayed in the text field is used as seed for randomized operations such as sampling. Otherwise a new seed is generated for each node execution.
Manual number of threads
Allows to specify the number of threads to use for training. The default if the checkbox is not selected is the number of available cores.

Booster

Eta
Also known as learning rate. Step size shrinkage used in updates in order to prevent overfitting. A smaller Eta value results in a more conservative boosting process.
Lambda
L2 regularization term on leaf weights. Increasing this value will make model more conservative
Alpha
L1 regularization term on leaf weights. Increasing this value will make model more conservative.
Gamma
Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger Gamma is, the more conservative the algorithm will be.
Maximum delta step
Maximum delta step we allow each leaf output to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update.
Booster
Select either the default tree booster or the DART booster.
Maximum depth
Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit. Note that limit is required when grow_policy is set of depthwise.
Minimum child weight
Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. The larger min_child_weight is, the more conservative the algorithm will be.
Tree method
The tree construction algorithm used in XGBoost. Can be one of
  • Auto: Use heuristic to choose the fastet method.
  • Exact: Exact greedy algorithm.
  • Approx: Approximate greedy algorithm using quantile sketch and gradient histogram.
  • Hist: Fast histogram optimized approximate greedy algorithm. It uses some performance improvements such as bin caching.
Sketch Epsilon
Only used for approximate tree method. Usually does not have to be set manually but consider it to a lower value for a more accurate enumeration of split candidates.
Scale positive weight
Controls the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances).
Grow policy
Controls the way new nodes are added to the trees. Currently only supported for tree method hist. One of
  • Depthwise: Split at nodes closest to the root.
  • Lossguide: Split at nodes with highest loss change.
Maximum number of leaves
Maximum number of nodes to be added. Only relevant for grow policy lossguide.
Maximum number of bins
Only used for tree method hist. Maximum number of discrete bins to bucket continuous features. Increasing this number improves the optimality of splits at the cost of higher computation time.
Sample type
Only relevant for DART booster. Uniform will drop trees uniformly while weighted will drop trees in proportion to weight.
Normalize type
Only relevant for DART booster.
  • Tree: New trees have the same weight as each of the dropped trees. Weights of new trees are 1 / (k + eta). Dropped trees are scaled by a factor of k / (k + eta).
  • Forest: New trees have the same weight as the sum of the dropped trees. Weights of new trees are 1 / (1 + eta). Dropped trees are scaled by a factor of 1 / (1 + eta).
Dropout rate
Only relevant for DART booster. Fraction of previous trees to drop during the dropout.
Drop at least one tree
Only relevant for DART booster. When this flag is enabled, at least one tree is always dropped during the dropout.
Skip dropout rate
Only relevant for DART booster. Probability of skipping the dropout procedure during a booster iteration. If a dropout is skipped, new trees are added in the same manner as for the vanilla tree booster. Not that a non-zero skip rate has a higher priority than the "drop at least one tree" flag.
Subsampling rate
Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. This is equivalent to bagging and can help to reduce overfitting. Subsampling will occur once in every boosting iteration.
Column sampling rate by tree
Subsample ratio of columns/features when constructing each tree. Subsampling will occur once in every boosting iteration.
Column sampling rate by level
Subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.
Column sampling rate by node
Subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level.

Input Ports

Icon
The data to learn from.

Output Ports

Icon
The trained model.
Icon
The feature importance measures for the training features. If the values are missing, then this indicates that the feature isn't used by the model at all.
  • Feature name column: The column containing feature names.
  • Weight column: The weight of a feature is the number of times a feature is used to split the data across all trees.
  • Gain column: The gain implies the average gain across all splits the feature is used in. A higher value of this metric when compared to another feature implies it is more important for generating a prediction.
  • Cover column: The cover of a feature is the average coverage across all splits the feature is used in.
  • Total gain column: The total gain sums up the gain across all splits the feature is used in.
  • Total cover column: The total cover sums up the total coverage across all splits the feature is used in.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.