Spark Decision Tree Learner (Regression)

This node uses the spark.ml implementation to train a regression model in Spark. The underlying algorithm performs a recursive binary partitioning of the feature space. Each split is chosen by selecting the best split from a set of possible splits, in order to maximize the information gain at a tree node. Information gain is calculated with a variance-based quality measure. The target column must be numerical, whereas the feature columns can be either nominal or numerical.

Use the Spark Predictor (Regression) node to apply the learned model to unseen data.

Please refer to the Spark documentation for a full description of the underlying algorithm.

This node requires at least Apache Spark 2.0.

Options

Settings

Target column: A numeric column that contains the values to train with. Rows with missing values in this column will be ignored during model training.
Feature Columns: The feature columns to learn the model with. Both nominal and numeric columns are supported. The dialog allows to select the columns manually (by moving them to the right panel) or via a wildcard/regex selection (all columns whose names match the wildcard/regex are used for learning). In case of manual selection, the behavior for new columns (i.e. that are not available at the time you configure the node) can be specified as either Enforce exclusion (new columns are excluded and therefore not used for learning) or Enforce inclusion (new columns are included and therefore used for learning).
Max tree depth: Maximum depth of the . Must be >= 1.
Min rows per tree node: Minimum number of rows each tree node must have. If a split causes the left or right child node to have fewer rows, the split will be discarded as invalid. Must be >= 1.
Min information gain per split: Minimum information gain for a split to be considered.

Advanced

Max number of bins: Number of bins to use when discretizing continuous features. Increasing the number of bins means that the algorithm will consider more split candidates and make more fine-grained decisions on how to split. However, it also increases the amount of computation and communication that needs to be performed and hence increases training time. Additionally, the number of bins must be at least the maximum number of distinct values for any nominal feature.
Use static random seed: Seed for generating random numbers. Randomness is used when binning numeric features during splitting.

Input Ports

: Input Spark DataFrame with training data.

Output Ports

: Table with estimates of the importance of each feature. The features are listed in order of decreasing importance and are normalized to sum up to 1. Note that feature importances for single s can have high variance due to correlated predictor variables. Consider using the Spark Random Forest Learner to determine feature importance instead.
: Spark ML Decision Tree model (regression)

Popular Predecessors

CSV to Spark33 %
Spark Category To Number33 %
Table to Spark33 %

Popular Successors

Views

This node has no views

Workflows

No workflows found

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME Extension for Apache Spark (legacy) from the below update site following our NodePit Product and Node Installation Guide:

v5.6

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.6.0.v202507151409

On NodePit since: 2025-08-15

Last update: 2025-08-21

KNIME versions: Since v4.0

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!