# Logistic Regression Learner

Performs a multinomial logistic regression. Select in the dialog a target column (combo box on top), i.e. the response. The solver combo box allows you to select which solver should be used for the problem (see below for details on the different solvers). The two lists in the center of the dialog allow you to include only certain columns which represent the (independent) variables. Make sure the columns you want to have included being in the right "include" list. See article in wikipedia about logistic regression for an overview about the topic.

#### Important Note on Normalization

The SAG solver works best with z-score normalized data. That means that the columns are normalized to have zero mean and a standard deviation of one. This can be achieved by using a normalizer node before learning. If you have very sparse data (lots of zero values), this normalization will destroy the sparsity. In this case it is recommended to only normalize the dense features to exploit the sparsity during the calculations (SAG solver with lazy calculation). Note, however, that the normalization will lead to different coefficients and statistics of those (standard error, z-score, etc.). Hence if you want to use the learner for statistics (obtaining the mentioned statistics) rather than machine learning (obtaining a classifier), you should carefully consider if normalization makes sense for your task at hand. If the node outputs missing values for the parameter statistics, this is very likely caused by insufficient normalization and you will have to use the IRLS solver if you can't normalize your data.

#### Solvers

The solver is the most important choice you make as it will dictate which algorithm is used to solve the problem.
• Iteratively reweighted least squares This solver uses an iterative optimization approach which is also sometimes termed Fisher's scoring, to calculate the model. It works well for small tables with only view columns but fails on larger tables. Note that it is the most error prone solver because it can't calculate a model if the data is linearly separable (see Potential Errors and Error Handling for more information). This solver is also not capable of dealing with tables where there are more columns than rows because it does not support regularization.
• Stochastic average gradient (SAG) This solver implements a variant of stochastic gradient descent which tends to converge considerably faster than vanilla stochastic gradient descent. For more information on the algorithm see the following paper. It works well for large tables and also tables with more columns than rows. Note that in the later case a regularization prior other than "uniform" must be selected. The default learning rate of 0.1 was selected because it often works well but ultimately the optimal learning rate always depends on the data and should be treated as a hyperparameter.

#### Learning Rate/Step Size Strategy

Only relevant for the SAG solver. The learning rate strategy provides the learning rates for the gradient descent. When selecting a learning rate strategy and initial learning rate keep in mind that there is always a trade off between the size of the learning rate and the number of epochs that are required to converge to a solution. With a smaller learning rate the solver will take longer to find a solution but if the learning rate is too large it might skip over the optimal solution and diverge in the worst case.
• Fixed The provided step size is used for the complete training. This strategy works well for the SAG solver, even if relatively large learning rates are used.
• Line Search Experimental learning rate strategy that tries to find the optimal learning rate for the SAG solver.

#### Regularization

The SAG solver optimizes the problem using maximum a posteriori estimation which allows to specify a prior distribution for the coefficients of the resulting model. This form of regularization is the Bayesian version of other regularization approaches such as Ridge or LASSO. Currently the following priors are supported:
• Uniform This prior corresponds to no regularization at all and is the default. It essentially means that all values are equally likely for the coefficients.
• Gauss The coefficients are assumed to be normally distributed. This prior keeps the coefficients from becoming too large but does not force them to be zero. Using this prior is equivalent to using ridge regression (L2) with a lambda of 1/prior_variance.
• Laplace The coefficients are assumed to follow a Laplace or double exponential distribution. It tends to produce sparse solutions by forcing unimportant coefficients to be zero. It is therefore related to the LASSO (also known as L1 regularization).

#### Potential Errors and Error Handling

The computation of the model is an iterative optimization process that requires some properties of the data set. This requires a reasonable distribution of the target values and non-constant, uncorrelated columns. While some of these properties are checked during the node execution you may still run into errors during the computation. The list below gives some ideas what might go wrong and how to avoid such situations.
• Insufficient Information This is the case when the data does not provide enough information about one or more target categories. Try to get more data or remove rows for target categories that may cause the error. If you are interested in a model for one target category make sure to group the target column before. For instance, if your data contains as target categories the values "A", "B", ..., "Z" but you are only interested in getting a model for class "A" you can use a rule engine node to convert your target into "A" and "not A".
• Violation of Independence Logistic Regression is based on the assumption of statistical independence. A common preprocessing step is to us a correlation filter to remove highly correlated learning columns. Use a "Linear Correlation" along with a "Correlation Filter" node to remove redundant columns, whereby often it's sufficient to compute the correlation model on a subset of the data only.

## Options

#### Settings

Target column
Select the target column. Only columns with nominal data are allowed. The reference category is empty if the domain of the target column is not available. In this case the node determines the domain values right before computing the logistic regression model and chooses the last domain value as the targets reference category.
Reference category
The reference category is the category for which the probability is obtained as 1 minus the sum of all other probabilities. In a two class scenario this is usually the class for which you don't explicitly want to model the probability.
User order from target column domain
By default the target domain values are sorted lexicographically in the output, but you can enforce the order of the target column domain to be preserved by checking the box.
Note, if a target reference column is selected in the dropdown, the checkbox will have no influence on the coefficients of the model except that the output representation (e.g. order of rows in the coefficient table) may vary.
Solver
Select the solver to use. Either Iteratively reweighted least squares or Stochastic average gradient.
Feature selection
Specify the independent columns that should be included in the regression model. Numeric and nominal data can be included.
Use order from column domain
By default the domain values (categories) of nominal valued columns are sorted lexicographically, but you can check that the order from the column domain is used. Please note that the first category is used as a reference when creating the dummy variables.

Perform calculations lazily
If selected, the optimization is performed lazily i.e. the coefficients are only updated if their corresponding feature is actually present in the current sample. Usually faster than the normal version especially for sparse data (that is data where for the most rows the most values are zero). Currently only supported by the SAG solver.
Calculate statistics for coefficients
If selected, the node calculates the standard errors, z-score and P>|z| values for the coefficients. Note that those are affected by regularization in case of the Gauss prior. Calculating those statistics is expensive if the model is learned on many features and can be responsible for a significant part of the node runtime.
Maximal number of epochs
Here you can specify the maximal number of learning epochs you want to perform. That is the number of times you want to iterate over the full table. This value determines to a large extend how long learning will take. The solver will stop early if it reaches convergence therefore it is recommended to set a relatively high value for this parameter in order to give the solver enough time to find a good solution.
Epsilon
This value is used to determine whether the model converged. If the relative change of all coefficients is smaller than epsilon, the training is stopped.
Learning rate strategy
The strategy provides the learning rates for the optimization process. Only important for the SAG solver. For more information see the paragraph on learning rate strategies above.
Step size
The step size (learning rate) to use in case of the fixed learning rate strategy.
Prior
The prior distribution for the coefficients. See the paragraph on regularization above for more details.
Variance
The variance of the prior distribution. A larger variance corresponds to less regularization.
Hold data in memory
If selected, the data is read into an internal data structure which results into a tremendous speed up. It is highly recommended to use this option if you have enough main memory available especially if you use the SAG solver as their convergence rate highly depends on random access to individual samples.
Chunk size
If the data is not held completely in memory, the node reads chunks of data into memory to emulate random access for the SAG solver. This parameter specifies how large those chunks should be. The chunk size directly affects the convergence rate of the SAG solver, as those work best with complete random access and a larger chunk size will better approximate that. This especially means that the solver may need many epochs to converge if the chunk size is chosen too small.
Use seed
Check if you want to use a static seed. Recommended for reproducible results if you use the SAG solver.
Seed
The static seed to use. A click on the "New" button generates a new seed.

## Input Ports

Table on which to perform regression. The input must not contain missing values, you have to fix them by e.g. using the Missing Values node.

## Output Ports

Model to connect to a predictor node.
Coefficients and statistics (if calculated) of the logistic regression model.
Global learning and model properties like the number of iterations until convergence.

## Views

This node has no views