Spark Random Forest Learner (Regression)

A random forest* is an ensemble of decision trees. Learning a random forest model means training a set of independent decision trees in parallel. This node uses the random forest implementation to train a regression model in Spark. The target column must be numeric, whereas the feature columns can be either nominal or numerical.

Use the Spark Predictor (Regression) node to apply the learned model to unseen data.

Please refer to the Spark documentation for a full description of the underlying algorithm.

This node requires at least Apache Spark 2.0.

(*) RANDOM FORESTS is a registered trademark of Minitab, LLC and is used with Minitab’s permission.



Target column
A numeric column that contains the values to train with. Rows with missing values in this column will be ignored during model training.
Feature Columns
The feature columns to learn the model with. Both nominal and numeric columns are supported. The dialog allows to select the columns manually (by moving them to the right panel) or via a wildcard/regex selection (all columns whose names match the wildcard/regex are used for learning). In case of manual selection, the behavior for new columns (i.e. that are not available at the time you configure the node) can be specified as either Enforce exclusion (new columns are excluded and therefore not used for learning) or Enforce inclusion (new columns are included and therefore used for learning).
Number of models
The number of decision trees in the forest. Increasing this number will make the random forest model less likely to overfit, but also directly increase training time.
Max tree depth
Maximum depth of the decision trees. Must be >= 1.
Min rows per tree node
Minimum number of rows each tree node must have. If a split causes the left or right child node to have fewer rows, the split will be discarded as invalid. Must be >= 1.
Min information gain per split
Minimum information gain for a split to be considered. Note that for regression, the information gain is always calculated with a variance-based quality measure.
Max number of bins
Number of bins to use when discretizing continuous features. Increasing the number of bins means that the algorithm will consider more split candidates and make more fine-grained decisions on how to split. However, it also increases the amount of computation and communication that needs to be performed and hence increases training time. Additionally, the number of bins must be at least the maximum number of distinct values for any nominal feature.


Data sampling (rows)
Sampling the rows is also known as bagging, a very popular ensemble learning strategy. If sampling is disabled (default), then each decision tree is trained on the full data set. Otherwise each tree is trained with a different data sample that contains the configured fraction of rows of the original data.
Feature sampling
Feature sampling is also called random subspace method or attribute bagging. This option specifies the sample size for each split at a tree node:
  • Auto (default): If "Max number of models" is one, then this is the same as "All", otherwise "Square root" will be used.
  • All: Each sample contains all features.
  • Square root: Sample size is sqrt(number of features).
  • Log2: Sample size is log2(number of features).
  • One third: Sample size is 1/3 of the features.
Use static random seed
Seed for generating random numbers. Randomness is used when sampling rows and features, as well as binning numeric features during splitting.

Input Ports

Input Spark DataFrame with training data.

Output Ports

Table with estimates of the importance of each feature. The features are listed in order of decreasing importance and are normalized to sum up to 1.
Spark ML random forest model (regression)


This node has no views




You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.