0 ×

**KNIME Extension for Apache Spark core infrastructure** version **4.3.1.v202101261633** by **KNIME AG, Zurich, Switzerland**

Gradient Boosted Trees are ensembles of Decision Trees. Learning a Gradient Boosted Trees model means
training a sequence of Decision Trees one-by-one, in order to minimize a loss function. This node uses the
spark.ml Gradient Boosted Trees
implementation to train a classification model in Spark, using a
logistic loss function.

Note that only binary classification is supported. The target column must be nominal (with two distinct values), whereas the feature columns can be either nominal or numerical.

Use the
*Spark Predictor (Classification)*
node to apply the learned model to unseen data.

Please refer to the Spark documentation for a full description of the underlying algorithm.

*This node requires at least Apache Spark 2.0.*

- Target column
- A nominal column that contains the labels to train with. Note that the spark.ml algorithm only supports binary classification, hence the target column can only have two distinct values. Rows with missing values in this column will be ignored during model training.
- Feature Columns
- The feature columns to learn the model with. Both nominal and numeric columns are supported. The dialog allows to select the columns manually (by moving them to the right panel) or via a wildcard/regex selection (all columns whose names match the wildcard/regex are used for learning). In case of manual selection, the behavior for new columns (i.e. that are not available at the time you configure the node) can be specified as either Enforce exclusion (new columns are excluded and therefore not used for learning) or Enforce inclusion (new columns are included and therefore used for learning).
- Number of models
- The number of Decision Tree models in the ensemble model. Increasing this number makes the model more expressive and improves training data accuracy. However, increasing it too much may lead to overfitting. Also, increasing this number directly increases the time required to train the ensemble, because the trees need to be trained sequentially.
- Quality measure
- Measure to use for information gain calculation when evaluating splits. Available methods are "gini" (recommended) or "entropy". For more details on the available methods see the Spark documentation.
- Max tree depth
- Maximum depth of the Decision Trees. Must be >= 1.
- Min rows per tree node
- Minimum number of rows each tree node must have. If a split causes the left or right child node to have fewer rows, the split will be discarded as invalid. Must be >= 1.
- Min information gain per split
- Minimum information gain for a split to be considered.
- Max number of bins
- Number of bins to use when discretizing continuous features. Increasing the number of bins means that the algorithm will consider more split candidates and make more fine-grained decisions on how to split. However, it also increases the amount of computation and communication that needs to be performed and hence increases training time. Additionally, the number of bins must be at least the maximum number of distinct values for any nominal feature.

- Learning rate
- Learning rate in interval (0, 1] for shrinking the contribution of each decision tree in the ensemble. This parameter should not need to be tuned often. Decreasing this value may improve stability, if the algorithm behavior seems unstable.
- Data sampling (rows)
- Sampling the rows is also known as bagging, a very popular ensemble learning strategy. If sampling is disabled (default), then each Decision Tree is trained on the full data set. Otherwise each tree is trained with a different data sample that contains the configured fraction of rows of the original data.
- Feature sampling
- Feature sampling is also called random subspace method or attribute bagging. Its most famous application
are Random Forests, but it can also be used for Gradient Boosted Trees. This option specifies the sample size
for each split at a tree node:
*Auto*: If "Max number of models" is one, then this is the same as "All", otherwise "Square root" will be used.*All (default)*: Each sample contains all features.*Square root*: Sample size is sqrt(number of features).*Log2*: Sample size is log2(number of features).*One third*: Sample size is 1/3 of the features.

- Use static random seed
- Seed for generating random numbers. Randomness is used when sampling rows and features, as well as binning numeric features during splitting.

- Table with estimates of the importance of each feature. The features are listed in order of decreasing importance and are normalized to sum up to 1.
- Spark ML Gradient Boosted Trees model (classification)

- Table to Spark (83 %) Streamable
- Spark Partitioning (17 %)

- Spark Predictor (Classification) (80 %)
~~Model Writer~~(7 %) Deprecated- String Manipulation (Variable) (7 %)
- Model to Cell (7 %)

To use this node in KNIME, install KNIME Extension for Apache Spark from the following update site:

KNIME 4.3

A zipped version of the software site can be downloaded here.

You don't know what to do with this link? Read our NodePit Product and Node Installation Guide that explains you in detail how to install nodes to your KNIME Analytics Platform.

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well? Do you think, the search results could be improved or something is missing? Then please get in touch! Alternatively, you can send us an email to mail@nodepit.com, follow @NodePit on Twitter, or chat on Gitter!

Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.