Performs a multinomial logistic regression. Select in the dialog a target column (combo box on top), i.e.
the response. The solver combo box allows you to select which solver should be used for the problem
(see below for details on the different solvers). The two lists in the center of the dialog allow you to
include only certain columns which represent the (independent) variables. Make sure the columns you want to
have included being in the right "include" list. See article in wikipedia about
logistic regression
for an overview about the topic.
Important Note on Normalization
The SAG solver works best with z-score normalized data.
That means that the columns are normalized to have zero mean and a standard deviation of one. This can be
achieved by using a normalizer node before learning. If you have very sparse data (lots of zero values),
this normalization will destroy the sparsity. In this case it is recommended to only normalize the dense
features to exploit the sparsity during the calculations (SAG solver with lazy calculation). Note, however,
that the normalization will lead to different coefficients and statistics of those
(standard error, z-score, etc.). Hence if you want to use the learner for statistics
(obtaining the mentioned statistics) rather than machine learning (obtaining a classifier), you should
carefully consider if normalization makes sense for your task at hand. If the node outputs missing values
for the parameter statistics, this is very likely caused by insufficient normalization and you will have to
use the IRLS solver if you can't normalize your data.
Solvers
The solver is the most important choice you make as it will dictate which algorithm is used to solve the
problem.
Iteratively reweighted least squares This solver uses an iterative optimization approach
which is also sometimes termed Fisher's scoring, to calculate the model. It works well for small
tables with only view columns but fails on larger tables. Note that it is the most error prone
solver because it can't calculate a model if the data is linearly separable (see Potential Errors
and Error Handling for more information). This solver is also not capable of dealing with tables
where there are more columns than rows because it does not support regularization.
Stochastic average gradient (SAG) This solver implements a variant of stochastic gradient
descent which tends to converge considerably faster than vanilla stochastic gradient descent. For
more information on the algorithm see the following
paper. It works well for large tables and also tables
with more columns than rows. Note that in the later case a regularization prior other than "uniform"
must be selected. The default learning rate of 0.1 was selected because it often works well but
ultimately the optimal learning rate always depends on the data and should be treated as a
hyperparameter.
Learning Rate/Step Size Strategy
Only relevant for the SAG solver. The learning rate strategy provides the learning rates for the
gradient descent. When selecting a learning rate strategy and initial learning rate keep in mind that
there is always a trade off between the size of the learning rate and the number of epochs that are
required to converge to a solution. With a smaller learning rate the solver will take longer to find a
solution but if the learning rate is too large it might skip over the optimal solution and diverge in
the worst case.
Fixed The provided step size is used for the complete training. This strategy works well for
the SAG solver, even if relatively large learning rates are used.
Line Search Experimental learning rate strategy that tries to find the optimal learning rate
for the SAG solver.
Regularization
The SAG solver optimizes the problem using
maximum a posteriori estimation
which allows to specify a prior distribution for the coefficients of the resulting model. This form of
regularization is the Bayesian version of other regularization approaches such as Ridge or LASSO. Currently
the following priors are supported:
Uniform This prior corresponds to no regularization at all and is the default. It essentially
means that all values are equally likely for the coefficients.
Gauss The coefficients are assumed to be normally distributed. This prior keeps the
coefficients from becoming too large but does not force them to be zero. Using this prior is
equivalent to using ridge regression (L2) with a lambda of 1/prior_variance.
Laplace The coefficients are assumed to follow a Laplace or double exponential distribution.
It tends to produce sparse solutions by forcing unimportant coefficients to be zero. It is therefore
related to the LASSO (also known as L1 regularization).
Potential Errors and Error Handling
The computation of the model is an iterative optimization process that requires some properties of the data
set. This requires a reasonable distribution of the target values and non-constant, uncorrelated columns.
While some of these properties are checked during the node execution you may still run into errors during
the computation. The list below gives some ideas what might go wrong and how to avoid such situations.
Insufficient Information This is the case when the data does not provide enough information
about one or more target categories. Try to get more data or remove rows for target categories that
may cause the error. If you are interested in a model for one target category make sure to group the
target column before. For instance, if your data contains as target categories the values
"A", "B", ..., "Z" but you are only interested in getting a model for class "A" you can use a rule
engine node to convert your target into "A" and "not A".
Violation of Independence Logistic Regression is based on the assumption of statistical
independence. A common preprocessing step is to us a correlation filter to remove highly correlated
learning columns. Use a "Linear Correlation" along with a "Correlation Filter" node to remove
redundant columns, whereby often it's sufficient to compute the correlation model on a subset of the
data only.
Select the target column. Only columns with nominal data are allowed. The reference category is
empty if the domain of the target column is not available. In this case the node determines the
domain values right before computing the logistic regression model and chooses the last domain
value as the targets reference category.
Reference category
The reference category is the category for which the probability is obtained as 1 minus the sum
of all other probabilities. In a two class scenario this is usually the class for which you don't
explicitly want to model the probability.
Use order from target column domain
By default the target domain values are sorted lexicographically in the output, but you can
enforce the order of the target column domain to be preserved by checking the box. Note, if a
target reference column is selected in the dropdown, the checkbox will have no influence on the
coefficients of the model except that the output representation (e.g. order of rows in the
coefficient table) may vary.
Solver
Select the solver to use. Either Iteratively reweighted least squares or Stochastic average
gradient.
Feature selection
Specify the independent columns that should be included in the regression model. Numeric and
nominal data can be included.
Use order from column domain
By default the domain values (categories) of nominal valued columns are sorted lexicographically,
but you can check that the order from the column domain is used. Please note that the first
category is used as a reference when creating the dummy variables.
Perform calculations lazily
If selected, the optimization is performed lazily i.e. the coefficients are only updated if their
corresponding feature is actually present in the current sample. Usually faster than the normal
version especially for sparse data (that is data where for the most rows the most values are
zero). Currently only supported by the SAG solver.
Calculate statistics for coefficients
If selected, the node calculates the standard errors, z-score and P>|z| values for the
coefficients. Note that those are affected by regularization in case of the Gauss prior.
Calculating those statistics is expensive if the model is learned on many features and can be
responsible for a significant part of the node runtime.
Maximal number of epochs
Here you can specify the maximal number of learning epochs you want to perform. That is the
number of times you want to iterate over the full table. This value determines to a large extend
how long learning will take. The solver will stop early if it reaches convergence therefore it is
recommended to set a relatively high value for this parameter in order to give the solver enough
time to find a good solution.
Epsilon
This value is used to determine whether the model converged. If the relative change of all
coefficients is smaller than epsilon, the training is stopped.
Learning rate strategy
The strategy provides the learning rates for the optimization process. Only important for the SAG
solver. For more information see the paragraph on learning rate strategies above.
Step size
The step size (learning rate) to use in case of the fixed learning rate strategy.
Prior
The prior distribution for the coefficients. See the paragraph on regularization above for more details.
Variance
The variance of the prior distribution. A larger variance corresponds to less regularization.
Hold data in memory
If selected, the data is read into an internal data structure which results into a tremendous
speed up. It is highly recommended to use this option if you have enough main memory available
especially if you use the SAG solver as their convergence rate highly depends on random access to
individual samples.
Chunk size
If the data is not held completely in memory, the node reads chunks of data into memory to
emulate random access for the SAG solver. This parameter specifies how large those chunks should
be. The chunk size directly affects the convergence rate of the SAG solver, as those work best
with complete random access and a larger chunk size will better approximate that. This especially
means that the solver may need many epochs to converge if the chunk size is chosen too small.
Use seed
Check if you want to use a static seed. Recommended for reproducible results if you use the SAG solver.
Random seed
The seed value for the random number generator.
New
Generate a random seed and set it in the Random seed input above for reproducible runs.
Input Ports
Table on which to perform regression. The input must not contain missing values, you have to fix them by
e.g. using the Missing Values node.
Output Ports
Model to connect to a predictor node.
Coefficients and statistics (if calculated) of the logistic regression model.
Global learning and model properties like the number of iterations until convergence.
Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well?
Do you think, the search results could be improved or something is missing?
Then please get in touch! Alternatively, you can send us an email to mail@nodepit.com.
Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.