Performs a multinomial logistic regression. Select in the dialog a
target column (combo box on top), i.e. the response.
The solver combo box allows you to select which solver should be used for the problem
(see below for details on the different solvers).
The two
lists in the center of the dialog allow you to include only certain
columns which represent the (independent) variables.
Make sure the columns you want to have included being in the right
"include" list.
See article in wikipedia about
logistic regression
for an overview about the topic.
Important Note on Normalization
The SAG solver works best with z-score normalized data.
That means that the columns are normalized to have zero mean and a standard deviation of one.
This can be achieved by using a normalizer node before learning.
If you have very sparse data (lots of zero values), this normalization will destroy the sparsity.
In this case it is recommended to only normalize the dense features to exploit the sparsity during
the calculations (SAG solver with lazy calculation).
Note, however, that the normalization will lead to different coefficients and statistics of those (standard error, z-score, etc.).
Hence if you want to use the learner for statistics (obtaining the mentioned statistics) rather than machine learning (obtaining a classifier),
you should carefully consider if normalization makes sense for your task at hand.
If the node outputs missing values for the parameter statistics, this is very likely caused by insufficient normalization and you will have
to use the IRLS solver if you can't normalize your data.
Solvers
The solver is the most important choice you make as it will dictate which algorithm is used to solve the problem.
Iteratively reweighted least squares This solver uses an iterative optimization approach which is also
sometimes termed Fisher's scoring, to calculate the model. It works well for small tables with only view columns
but fails on larger tables. Note that it is the most error prone solver because it can't calculate a model if the
data is linearly separable (see Potential Errors and Error Handling for more information).
This solver is also not capable of dealing with tables where there are more columns than rows because it does not
support regularization.
Stochastic average gradient (SAG) This solver implements a variant of stochastic gradient descent which tends to
converge considerably faster than vanilla stochastic gradient descent. For more information on the algorithm see
the following paper. It works well for large tables and also tables with
more columns than rows. Note that in the later case a regularization prior other than "uniform" must be selected.
The default learning rate of 0.1 was selected because it often works well but ultimately the optimal learning rate always
depends on the data and should be treated as a hyperparameter.
Learning Rate/Step Size Strategy
Only relevant for the SAG solver.
The learning rate strategy provides the learning rates for the gradient descent.
When selecting a learning rate strategy and initial learning rate keep in mind that there is always a trade off
between the size of the learning rate and the number of epochs that are required to converge to a solution.
With a smaller learning rate the solver will take longer to find a solution but if the learning rate is too large
it might skip over the optimal solution and diverge in the worst case.
Fixed The provided step size is used for the complete training. This strategy works well for the SAG solver,
even if relatively large learning rates are used.
Line Search Experimental learning rate strategy that tries to find the optimal learning rate for the SAG solver.
Regularization
The SAG solver optimizes the problem using
maximum a posteriori estimation which allows to specify a prior distribution for the coefficients of the resulting model.
This form of regularization is the Bayesian version of other regularization approaches such as Ridge or LASSO.
Currently the following priors are supported:
Uniform This prior corresponds to no regularization at all and is the default. It essentially means that all values
are equally likely for the coefficients.
Gauss The coefficients are assumed to be normally distributed. This prior keeps the coefficients from becoming
too large but does not force them to be zero. Using this prior is equivalent to using ridge regression (L2) with
a lambda of 1/prior_variance.
Laplace The coefficients are assumed to follow a Laplace or double exponential distribution. It tends to produce
sparse solutions by forcing unimportant coefficients to be zero. It is therefore related to the LASSO (also known as
L1 regularization).
Potential Errors and Error Handling
The computation of the model is an iterative optimization process that requires some properties of the data set.
This requires a reasonable distribution of the target values and non-constant, uncorrelated columns. While
some of these properties are checked during the node execution you may still run into errors during the
computation. The list below gives some ideas what might go wrong and how to avoid such situations.
Insufficient Information This is the case when the data does not provide enough information about
one or more target categories. Try to get more data or remove rows for target categories that may cause
the error. If you are interested in a model for one target category make sure to group the target
column before. For instance, if your data contains as target categories the values "A", "B", ..., "Z" but
you are only interested in getting a model for class "A" you can use a rule engine node to convert your
target into "A" and "not A".
Violation of Independence Logistic Regression is based on the assumption of statistical independence.
A common preprocessing step is to us a correlation filter to remove highly correlated learning columns.
Use a "Linear Correlation" along with a "Correlation Filter" node to remove redundant columns, whereby often
it's sufficient to compute the correlation model on a subset of the data only.
Select the target column. Only columns with nominal data are allowed. The reference category is empty
if the domain of the target column is not available. In this case the node determines the domain values right
before computing the logistic regression model and chooses the last domain value as the targets reference
category.
Reference category
The reference category is the category for which the probability is obtained as 1 minus the sum of all other probabilities.
In a two class scenario this is usually the class for which you don't explicitly want to model the probability.
User order from target column domain
By default the target domain values are sorted lexicographically in the output, but you can enforce the
order of the target column domain to be preserved by checking the box.
Note, if a target reference column is selected in the dropdown, the checkbox will have no influence on the
coefficients of the model except that the output representation (e.g. order of rows in the coefficient table)
may vary.
Solver
Select the solver to use. Either Iteratively reweighted least squares or Stochastic average gradient.
Feature selection
Specify the independent columns that should be included in the regression model.
Numeric and nominal data can be included.
Use order from column domain
By default the domain values (categories) of nominal valued columns are sorted lexicographically,
but you can check that the order from the column domain is used. Please note that the first
category is used as a reference when creating the
dummy variables.
Advanced
Perform calculations lazily
If selected, the optimization is performed lazily i.e. the coefficients are only updated
if their corresponding feature is actually present in the current sample. Usually faster than the normal version especially for sparse
data (that is data where for the most rows the most values are zero). Currently only supported by the SAG solver.
Calculate statistics for coefficients
If selected, the node calculates the standard errors, z-score and P>|z| values for the coefficients.
Note that those are affected by regularization in case of the Gauss prior.
Calculating those statistics is expensive if the model is learned on many features and can be responsible for a significant part of
the node runtime.
Maximal number of epochs
Here you can specify the maximal number of learning epochs you want to perform. That is the number of times you want
to iterate over the full table. This value determines to a large extend how long learning will take.
The solver will stop early if it reaches convergence therefore it is recommended to set a relatively high value for this parameter
in order to give the solver enough time to find a good solution.
Epsilon
This value is used to determine whether the model converged. If the relative change of all coefficients is smaller than epsilon,
the training is stopped.
Learning rate strategy
The strategy provides the learning rates for the optimization process. Only important for the SAG solver. For more information
see the paragraph on learning rate strategies above.
Step size
The step size (learning rate) to use in case of the fixed learning rate strategy.
Prior
The prior distribution for the coefficients. See the paragraph on regularization above for more details.
Variance
The variance of the prior distribution. A larger variance corresponds to less regularization.
Hold data in memory
If selected, the data is read into an internal data structure which results into a tremendous speed up.
It is highly recommended to use this option if you have enough main memory available especially if you use the SAG solver
as their convergence rate highly depends on random access to individual samples.
Chunk size
If the data is not held completely in memory, the node reads chunks of data into memory to emulate random access for the SAG solver.
This parameter specifies how large those chunks should be. The chunk size directly affects the convergence rate of the SAG solver,
as those work best with complete random access and a larger chunk size will better approximate that. This especially means that the solver
may need many epochs to converge if the chunk size is chosen too small.
Use seed
Check if you want to use a static seed. Recommended for reproducible results if you use the SAG solver.
Seed
The static seed to use. A click on the "New" button generates a new seed.
Input Ports
Table on which to perform regression. The input must not contain missing values, you have to fix them by e.g. using the Missing Values node.
Output Ports
Model to connect to a predictor node.
Coefficients and statistics (if calculated) of the logistic regression model.
Global learning and model properties like the number of iterations until convergence.
Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well?
Do you think, the search results could be improved or something is missing?
Then please get in touch! Alternatively, you can send us an email to mail@nodepit.com.
Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.