0 ×

H2O Random Forest Learner

DeprecatedKNIME H2O Machine Learning Integration version 4.1.0.v201911271254 by KNIME AG, Zurich, Switzerland

Learns a Distributed Random Forest (DRF) classification model, which is a special version of the random forest* algorithm provided by H2O .

(*) RANDOM FORESTS is a registered trademark of Minitab, LLC and is used with Minitab’s permission.

Options

General Settings

Target Column
Select target column. Must be nominal for classification problems.
Column selection
Select columns used for model training.
Ignore constant columns
Select to ignore constant columns.
Number of levels (tree depth)
Specify the maximum tree depth (max_depth) .
Number of models
Specify the number of trees (ntrees) .
Use static random seed
Select to use static seed for randomization.

Algorithm Settings

Min (weighted) observations
Specify the minimum number of observations for a leaf (min_rows) .
Min relative improvement rate
The value of this option specifies the minimum relative improvement in squared error reduction in order for a split to happen. When properly tuned, this option can help reduce overfitting. Optimal values would be in the 1e-10...1e-3 range (min_split_improvement) .
Row sample rate (per tree)
Specify the row sampling rate (x-axis). The range is 0.0 to 1.0. Higher values may improve training accuracy. Test accuracy improves when either columns or rows are sampled. For details, refer to “Stochastic Gradient Boosting” (sample_rate) .
Class specific sample rate (per tree)
When building models from imbalanced datasets, this option specifies that each tree in the ensemble should sample from the full training dataset using a per-class-specific sampling rate rather than a global sample factor (as with sample_rate). The range for this option is 0.0 to 1.0. If this option is specified along with sample_rate, then only the first option that DRF encounters will be used (sample_rate_per_class) .
Column sample rate (per tree)
Specify the column sample rate per tree. This can be a value from 0.0 to 1.0. Note that it is multiplicative with col_sample_rate, so setting both parameters to 0.8, for example, results in 64% of columns being considered at any given node to split (col_sample_rate_per_tree) .
Column sample rate (global)
Specify the column sampling rate (y-axis). This acceptable value range is 0.0 to 1.0. Higher values may improve training accuracy (col_sample_rate) .
Relative change of column sample rate per level
This option specifies to change the column sampling rate as a function of the depth in the tree (col_sample_rate_change_per_level) .
Histogram type
By default (AUTO) DRF bins from min...max in steps of (max-min)/N. Random split points or quantile-based split points can be selected as well (histogram_type) .
Min number histogram bins (numerical)
Specify the number of bins for the histogram to build, then split at the best point (nbins) .
Max number root histogram bins (numerical)
Specify the minimum number of bins at the root level to use to build the histogram. This number will then be decreased by a factor of two per level (nbins_top_level) .
Number of bins histogram (categorical)
Specify the maximum number of bins for the histogram to build, then split at the best point. Higher values can lead to more overfitting. The levels are ordered alphabetically; if there are more levels than bins, adjacent levels share bins. This value has a more significant impact on model fitness than nbins. Larger values may increase runtime, especially for deep trees and large clusters, so tuning may be required to find the optimal value for your configuration (nbins_cats) .
M Tries
Specify the columns to randomly select at each level. If the default value of -1 is used, the number of variables is the square root of the number of columns for classification and p/3 for regression (where p is the number of predictors). The range is -1 to >=1 (mtries) .
Binominal double trees
(Binary classification only) Build twice as many trees (one per class). Enabling this option can lead to higher accuracy, while disabling can result in faster model building. This option is disabled by default (binominal_double_trees) .

Advanced Settings

Select categorical encoding
Specify one of the following encoding schemes for handling categorical features (categorical_encoding)
Weight column selection
Select a column to use for the observation weights, which are used for bias correction (weights_column) .
Max Runtime?
Maximum allowed runtime in seconds for model training (max_runtime_secs) .
Early Stopping?
Select to activate early stopping.
Stopping metric
Specify the metric to use for early stopping (stopping_metric) .
Stopping tolerance
Specify the relative tolerance for the metric-based stopping to stop training if the improvement is less than this value (stopping_tolerance) .
Number of last seen rows for moving average
Stops training when the option selected for stopping_metric doesn’t improve for the specified number of training rounds, based on a simple moving average. To disable this feature, specify 0. The metric is computed on the validation data (if provided); otherwise, training data is used (stopping_rounds) .
Size of validation set (in %)
Specify the size of the validation data-set used to evaluate early stopping criteria.
Balance clases?
Oversample the minority classes to balance the class distribution. This option is not enabled by default and can increase the data frame size. This option is only applicable for classification (balance_classes) .
Define max number of rows after balancing
This specifies the maximum relative size of the training data after balancing class counts (max_after_balance_size) .

Input Ports

H2O Frame with training data.

Output Ports

H2O Distributed Random Forest classification model.

Best Friends (Incoming)

Best Friends (Outgoing)

Workflows

Installation

To use this node in KNIME, install KNIME H2O Machine Learning Integration from the following update site:

KNIME 4.1
Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.