Spark Linear Regression Learner

This node uses the spark.ml linear regression implementation to train a linear regression model in Spark, supporting different regularization options. The target column must be numeric, whereas the feature columns can be either nominal or numerical.

Use the Spark Predictor (Regression) node to apply the learned model to unseen data.

Please refer to the Spark documentation for a full description of the underlying algorithm.

This node requires at least Apache Spark 2.4.

Options

Settings

Target column
A numeric column that contains the values to train, also known as the dependent variable.
Feature Columns
The feature columns to learn the model with. Both nominal and numeric columns are supported, whereby for nominal data dummy variables are automatically created as described in section Categorical variables and regression . The dialog allows you to select the columns manually (by moving them to the right panel) or via a wildcard/regex selection (all columns whose names match the wildcard/regex are used for learning). In case of manual selection, the behavior for new columns (i.e. that are not available at the time you configure the node) can be specified as either Enforce exclusion (new columns are excluded and therefore not used for learning) or Enforce inclusion (new columns are included and therefore used for learning).
Loss function
The supported loss functions are Squared Error (default) and Huber. Please refer to the Spark documentation for more information.
Standardize features
Whether to standardize the training features before fitting the model. Note that the coefficients of models will be always returned on the original scale.
Regularizer
The purpose of the regularizer is to encourage simple models and avoid overfitting. The supported types of regularization are:
  • None (a.k.a. ordinary least squares)
  • Ridge Regression (L2) using a given regularization parameter
  • Lasso (L1) using a given regularization parameter
  • Elastic Net (L1+L2) using a given regularization and Elastic Net parameter
Regularization parameter
Defines the regularization penalty.
Elastic net parameter
Defines the mixing parameter between L1 and L2 regularization. 0 corresponds to L2 regularization. 1 corresponds to L1 regularization. For values in (0,1), the penalty is a combination of L1 and L2.
Missing Values in Input Columns
Defines how rows with missing values in the target and feature columns should be handled:
  • Ignore: Ignores the entire row during model training, if any of the input columns contain a missing value.
  • Fail: Aborts the node execution with an error, if any of the input columns contain a missing value.

Advanced

Solver
Supported solver algorithm used for optimization:
  • Auto (default) means that the solver algorithm is selected automatically.
  • Limited-memory BFGS (L-BFGS), which is a limited-memory quasi-Newton optimization method.
  • Normal Equation uses a normal equation solver.
Maximum iterations
The maximum number of iterations, if not terminated by Convergence tolerance.
Convergence tolerance
Set the convergence tolerance of iterations. Smaller values lead to higher accuracy at the cost of more iterations. The number of iterations is always bounded by Maximum iterations.
Fit intercept
Whether to fit an intercept term or not.

Input Ports

Icon
Input Spark DataFrame with training data.

Output Ports

Icon
Spark ML linear learner model (regression)
Icon
Coefficients and statistics of the linear regression model.
Icon
Statistical measures of the learned regression model, when applied to the training dataset (R², explained variance, ...)

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.