XGBoost Tree Ensemble Learner (Regression)

Learns a tree based XGBoost model for regression. XGBoost is a popular machine learning library that is based on the ideas of boosting. Checkout the official documentation for some tutorials on how XGBoost works. Since XGBoost requires its features to be single precision floats, we automatically cast double precision values to float, which can cause problems for extreme numbers.

Options

Objective

One of

linear
logistic
gamma
poisson
tweedie

Tweedie regression variance

Controls the variance of the Tweedie distribution. Must be in the range (1, 2) and is by default set to 1.5.

Target column

The column containing the regression target.

Weight column

The column containing the row weights (also called sample weights or instance weights). Note that the selected column must not contain missing values.

Feature columns

Allows to select which columns should be used as features in training. Note that the domain of nominal features must contain the possible values otherwise the node can't be executed. Use the Domain Calculator node to calculate any missing possible value sets.

Boosting rounds

The number of models to train in the boosting ensemble.

Base score

The initial prediction score of all instances; this global bias will have little effect for a sufficiently large number of iterations.

Use static random seed

If checked, the seed displayed in the text field is used as seed for randomized operations such as sampling. Otherwise a new seed is generated for each node execution.

Manual number of threads

Allows to specify the number of threads to use for training. The default if the checkbox is not selected is the number of available cores.

Booster

Eta

Also known as learning rate. Step size shrinkage used in updates in order to prevent overfitting. A smaller Eta value results in a more conservative boosting process.

Lambda

L2 regularization term on leaf weights. Increasing this value will make model more conservative

Alpha

L1 regularization term on leaf weights. Increasing this value will make model more conservative.

Gamma

Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger Gamma is, the more conservative the algorithm will be.

Maximum delta step

Maximum delta step we allow each leaf output to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update.

Booster

Select either the default tree booster or the DART booster.

Maximum depth

Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit. Note that limit is required when grow_policy is set of depthwise.

Minimum child weight

Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. The larger min_child_weight is, the more conservative the algorithm will be.

Tree method

The tree construction algorithm used in XGBoost. Can be one of

Auto: Use heuristic to choose the fastet method.
Exact: Exact greedy algorithm.
Approx: Approximate greedy algorithm using quantile sketch and gradient histogram.
Hist: Fast histogram optimized approximate greedy algorithm. It uses some performance improvements such as bin caching.

Sketch Epsilon

Only used for approximate tree method. Usually does not have to be set manually but consider it to a lower value for a more accurate enumeration of split candidates.

Scale positive weight

Controls the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances).

Grow policy

Controls the way new nodes are added to the trees. Currently only supported for tree method hist. One of

Depthwise: Split at nodes closest to the root.
Lossguide: Split at nodes with highest loss change.

Maximum number of leaves

Maximum number of nodes to be added. Only relevant for grow policy lossguide.

Maximum number of bins

Only used for tree method hist. Maximum number of discrete bins to bucket continuous features. Increasing this number improves the optimality of splits at the cost of higher computation time.

Sample type

Only relevant for DART booster. Uniform will drop trees uniformly while weighted will drop trees in proportion to weight.

Normalize type

Only relevant for DART booster.

Tree: New trees have the same weight as each of the dropped trees. Weights of new trees are 1 / (k + eta). Dropped trees are scaled by a factor of k / (k + eta).
Forest: New trees have the same weight as the sum of the dropped trees. Weights of new trees are 1 / (1 + eta). Dropped trees are scaled by a factor of 1 / (1 + eta).

Dropout rate

Only relevant for DART booster. Fraction of previous trees to drop during the dropout.

Drop at least one tree

Only relevant for DART booster. When this flag is enabled, at least one tree is always dropped during the dropout.

Skip dropout rate

Only relevant for DART booster. Probability of skipping the dropout procedure during a booster iteration. If a dropout is skipped, new trees are added in the same manner as for the vanilla tree booster. Not that a non-zero skip rate has a higher priority than the "drop at least one tree" flag.

Subsampling rate

Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. This is equivalent to bagging and can help to reduce overfitting. Subsampling will occur once in every boosting iteration.

Column sampling rate by tree

Subsample ratio of columns/features when constructing each tree. Subsampling will occur once in every boosting iteration.

Column sampling rate by level

Subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.

Column sampling rate by node

Subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level.

Input Ports

: The data to learn from.

Output Ports

The trained model.

The feature importance measures for the training features. If the values are missing, then this indicates that the feature isn't used by the model at all.

Feature name column: The column containing feature names.
Weight column: The weight of a feature is the number of times a feature is used to split the data across all trees.
Gain column: The gain implies the average gain across all splits the feature is used in. A higher value of this metric when compared to another feature implies it is more important for generating a prediction.
Cover column: The cover of a feature is the average coverage across all splits the feature is used in.
Total gain column: The total gain sums up the gain across all splits the feature is used in.
Total cover column: The total cover sums up the total coverage across all splits the feature is used in.

Popular Predecessors

Popular Successors

Views

This node has no views

Workflows

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME XGBoost Integration from the below update site following our NodePit Product and Node Installation Guide:

v5.5

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.5.0.v202412191419

On NodePit since: 2025-07-02

Last update: 2025-07-21

KNIME versions: Since v4.6

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!