0 ×

Deprecated**KNIME H2O Machine Learning Integration** version **4.1.1.v202001312017** by **KNIME AG, Zurich, Switzerland**

Learns a Gradient Boosting Machine (GBM) regression model using H2O .

- Target Column
- Select target column. Must be numeric for regression problems.
- Column selection
- Select columns used for model training.
- Ignore constant columns
- Select to ignore constant columns.
- Number of levels (tree depth)
- Specify the maximum tree depth (max_depth) .
- Number of models
- Specify the number of trees (ntrees) .
- Learning rate
- Specify the learning rate. The range is 0.0 to 1.0 (learn_rate) .

- Min (weighted) observations
- Specify the minimum number of observations for a leaf (min_rows) .
- Min relative improvement rate
- The value of this option specifies the minimum relative improvement in squared error reduction in order for a split to happen. When properly tuned, this option can help reduce overfitting. Optimal values would be in the 1e-10...1e-3 range (min_split_improvement) .
- Row sample rate (per tree)
- Specify the row sampling rate (x-axis). The range is 0.0 to 1.0. Higher values may improve training accuracy. Test accuracy improves when either columns or rows are sampled. For details, refer to “Stochastic Gradient Boosting” (sample_rate) .
- Class specific sample rate (per tree)
- When building models from imbalanced datasets, this option specifies that each tree in the ensemble should sample from the full training dataset using a per-class-specific sampling rate rather than a global sample factor (as with sample_rate). The range for this option is 0.0 to 1.0. If this option is specified along with sample_rate, then only the first option that DRF encounters will be used (sample_rate_per_class) .
- Column sample rate (per tree)
- Specify the column sample rate per tree. This can be a value from 0.0 to 1.0. Note that it is multiplicative with col_sample_rate, so setting both parameters to 0.8, for example, results in 64% of columns being considered at any given node to split (col_sample_rate_per_tree) .
- Column sample rate (global)
- Specify the column sampling rate (y-axis). This acceptable value range is 0.0 to 1.0. Higher values may improve training accuracy (col_sample_rate) .
- Relative change of column sample rate per level
- This option specifies to change the column sampling rate as a function of the depth in the tree (col_sample_rate_change_per_level) .
- Histogram type
- By default (AUTO) DRF bins from min...max in steps of (max-min)/N. Random split points or quantile-based split points can be selected as well (histogram_type) .
- Min number histogram bins (numerical)
- Specify the number of bins for the histogram to build, then split at the best point (nbins) .
- Max number root histogram bins (numerical)
- Specify the minimum number of bins at the root level to use to build the histogram. This number will then be decreased by a factor of two per level (nbins_top_level) .
- Learn rate annealing
- Specifies to reduce the learn_rate by this factor after every tree. So for N trees, GBM starts with learn_rate and ends with learn_rate * learn_rate_annealing**^*N*. For example, instead of using **learn_rate=0.01, you can now try learn_rate=0.05 and learn_rate_annealing=0.99. This method would converge much faster with almost the same accuracy. Use caution not to overfit (learn_rate_annlealing) .
- Distribution
- Specify the distribution (i.e., the loss function). The options are AUTO, bernoulli, multinomial, gaussian, poisson, gamma, laplace, quantile, huber, or tweedie (distribution) .
- Max absolute value of leaf node prediction
- When building a GBM classification model, this option reduces overfitting by limiting the maximum absolute value of a leaf node prediction. This option defaults to Double.MAX_VALUE (max_abs_leafnode_pred) .
- Bandwidth of Gaussian multiplicative noise
- The bandwidth (sigma) of Gaussian multiplicative noise ~N(1,sigma) for tree node predictions. If this parameter is specified with a value greater than 0, then every leaf node prediction is randomly scaled by a number drawn from a Normal distribution centered around 1 with a bandwidth given by this parameter (pred_noise_bandwidth) .
- Quantile alpha
- (Only applicable if Quantile is specified for distribution) Specify the quantile to be used for Quantile Regression (quantile_alpha)
- Tweedie power
- (Only applicable if Tweedie is specified for distribution) Specify the Tweedie power. The range is from 1 to 2. For a normal distribution, enter 0. For Poisson distribution, enter 1. For a gamma distribution, enter 2. For a compound Poisson-gamma distribution, enter a value greater than 1 but less than 2. For more information, refer to Tweedie distribution (tweedie_power) .
- Huber alpha
- Specify the desired quantile for Huber/M-regression (the threshold between quadratic and linear loss). This value must be between 0 and 1 (huber_alpha) .
- Class specific sampling factors
- Specify the per-class (in lexicographical order) over/under-sampling ratios. By default, these ratios are automatically computed during training to obtain the class balance (class_sampling_factors)

- Select categorical encoding
- Specify one of the following encoding schemes for handling categorical features (categorical_encoding) .
- Weight column selection
- Select a column to use for the observation weights, which are used for bias correction (weights_column) .
- Offset column selection
- Specify a column to use as the offset. Note: Offsets are per-row “bias values” that are used during model training. (offset_column) .
- Max Runtime?
- Maximum allowed runtime in seconds for model training (max_runtime_secs) .
- Use static random seed
- Select to use static seed for randomization.
- Early Stopping?
- Select to activate early stopping.
- Stopping metric
- Specify the metric to use for early stopping (stopping_metric) .
- Stopping tolerance
- Specify the relative tolerance for the metric-based stopping to stop training if the improvement is less than this value (stopping_tolerance) .
- Number of last seen rows for moving average
- Stops training when the option selected for stopping_metric doesn’t improve for the specified number of training rounds, based on a simple moving average. To disable this feature, specify 0. The metric is computed on the validation data (if provided); otherwise, training data is used (stopping_rounds) .
- Size of validation set (in %)
- Specify the size of the validation data-set used to evaluate early stopping criteria.

- Table to H2O (53 %)
- H2O Partitioning (13 %)
- Parameter Optimization Loop Start (13 %)
- Partitioning (7 %)
~~H2O Cross Validation Loop Start~~(7 %) Deprecated- H2O Generalized Low Rank Models (Missing Value Impute) (7 %)
- Show all 6 recommendations

- H2O Predictor (Regression) (77 %)
- H2O Model to MOJO (10 %)
~~Variable to Table Column~~(3 %) StreamableDeprecated- Logistic Regression Learner (3 %)
- Joiner (3 %)
~~H2O Cluster Assigner~~(3 %) Deprecated- Show all 6 recommendations

To use this node in KNIME, install KNIME H2O Machine Learning Integration from the following update site:

KNIME 4.1

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well? Do you think, the search results could be improved or something is missing? Then please get in touch! Alternatively, you can send us an email to mail@nodepit.com, follow @NodePit on Twitter, or chat on Gitter!

Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.