H2O Gradient Boosting Machine Learner (Regression)

Learns a Gradient Boosting Machine (GBM) regression model using H2O .

Options

General Settings

Target Column
Select target column. Must be numeric for regression problems.
Column selection
Select columns used for model training.
Ignore constant columns
Select to ignore constant columns.
Number of levels (tree depth)
Specify the maximum tree depth (max_depth) .
Number of models
Specify the number of trees (ntrees) .
Learning rate
Specify the learning rate. The range is 0.0 to 1.0 (learn_rate) .

Algorithm Settings

Min (weighted) observations
Specify the minimum number of observations for a leaf (min_rows) .
Min relative improvement rate
The value of this option specifies the minimum relative improvement in squared error reduction in order for a split to happen. When properly tuned, this option can help reduce overfitting. Optimal values would be in the 1e-10...1e-3 range (min_split_improvement) .
Row sample rate (per tree)
Specify the row sampling rate (x-axis). The range is 0.0 to 1.0. Higher values may improve training accuracy. Test accuracy improves when either columns or rows are sampled. For details, refer to “Stochastic Gradient Boosting” (sample_rate) .
Column sample rate (per tree)
Specify the column sample rate per tree. This can be a value from 0.0 to 1.0. Note that it is multiplicative with col_sample_rate, so setting both parameters to 0.8, for example, results in 64% of columns being considered at any given node to split (col_sample_rate_per_tree) .
Relative change of column sample rate per level
This option specifies to change the column sampling rate as a function of the depth in the tree (col_sample_rate_change_per_level) .
Histogram type
By default (AUTO) GBM bins from min...max in steps of (max-min)/N. Random split points or quantile-based split points can be selected as well (histogram_type) .
Number of histogram bins (numerical)
Specify the number of bins for the histogram to build, then split at the best point (nbins) .
Number of histogram bins (categorical)
Specify the number of bins for the histogram to build, then split at the best point. Higher values can lead to more overfitting. The levels are ordered alphabetically; if there are more levels than bins, adjacent levels share bins. This value has a more significant impact on model fitness than nbins. Larger values may increase runtime, especially for deep trees and large clusters, so tuning may be required to find the optimal value for your configuration (nbins_cats) .
Number of root histogram bins (numerical)
Specify the number of bins at the root level to use to build the histogram. This number will then be decreased by a factor of two per level, whereby nbins controls when to stop dividing (nbins_top_level) .
Column sample rate (global)
Specify the column sampling rate (y-axis). The acceptable value range is 0.0 to 1.0. Higher values may improve training accuracy (col_sample_rate) .
Learning rate annealing
Specifies to reduce the learn_rate by this factor after every tree. So for N trees, GBM starts with learn_rate and ends with learn_rate * learn_rate_annealing^N. For example, instead of using learn_rate=0.01, you can learn_rate=0.05 and learn_rate_annealing=0.99. This method should converge much faster with almost the same accuracy. Use caution not to overfit (learn_rate_annlealing) .
Distribution
Specify the distribution (i.e., the loss function) (distribution) .
Max absolute value of leaf node prediction
When building a GBM classification model, this option reduces overfitting by limiting the maximum absolute value of a leaf node prediction. If set to 0, it is treated as Double.MAX_VALUE (max_abs_leafnode_pred) .
Bandwidth of Gaussian multiplicative noise
The bandwidth (sigma) of Gaussian multiplicative noise ~N(1,sigma) for tree node predictions. If this parameter is specified with a value greater than 0, then every leaf node prediction is randomly scaled by a number drawn from a normal distribution centered around 1 with a bandwidth given by this parameter (pred_noise_bandwidth) .
Quantile Alpha
(Only applicable if Quantile is specified as distribution) Specify the quantile to be used for Quantile Regression (quantile_alpha)
Tweedie Power
(Only applicable if Tweedie is specified as distribution) Specify the Tweedie power. The range is from 1 to 2 (exclusive) (tweedie_power) .
Huber Alpha
(Only applicable if Huber is specified as distribution) Specify the desired quantile for Huber/M-regression (the threshold between quadratic and linear loss). This value must be between 0 and 1 (huber_alpha) .

Advanced Settings

Select categorical encoding
Specify one of the following encoding schemes for handling categorical features (categorical_encoding) .
Early Stopping
Select to activate early stopping.
Stopping metric
Specify the metric to use for early stopping (stopping_metric) .
Stopping tolerance
Specify the relative tolerance for the metric-based stopping to stop training if the improvement is less than this value (stopping_tolerance) .
Number of last seen rows for moving average
Stops training when the option selected for stopping_metric doesn’t improve for the specified number of training rounds, based on a simple moving average. To disable this feature, specify 0. If disabled, the metric is computed on the validation data (if provided); otherwise, training data is used (stopping_rounds) .
Size of validation set (in %)
Specify the size of the validation dataset used to evaluate early stopping criteria.
Max runtime in seconds
Maximum allowed runtime in seconds for model training (max_runtime_secs) .
Use static random seed
Select to use a static seed for randomization.
Weights column (optional)
Select a column to use for the observation weights which are used for bias correction (weights_column) .
Offset column selection
Specify a column to use as the offset. Note: Offsets are per-row “bias values” that are used during model training. (offset_column) .

Input Ports

Icon
H2O Frame with training data.

Output Ports

Icon
Variable importance in tabular format.
Icon
H2O Gradient Boosting Machine regression model.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.