Decision Tree Learner

This node induces a classification decision tree in main memory. The target attribute must be nominal. The other attributes used for decision making can be either nominal or numerical. Numeric splits are always binary (two outcomes), dividing the domain in two partitions at a given split point. Nominal splits can be either binary (two outcomes) or they can have as many outcomes as nominal values. In the case of a binary split the nominal values are divided into two subsets. The algorithm provides two quality measures for split calculation; the gini index and the gain ratio. Further, there exist a post pruning method to reduce the tree size and increase prediction accuracy. The pruning method is based on the minimum description length principle.
The algorithm can be run in multiple threads, and thus, exploit multiple processors or cores.
Most of the techniques used in this decision tree implementation can be found in "C4.5 Programs for machine learning", by J.R. Quinlan and in "SPRINT: A Scalable Parallel Classifier for Data Mining", by J. Shafer, R. Agrawal, M. Mehta (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.152&rep=rep1&type=pdf)

Options

Class column
To select the target attribute. Only nominal attributes are allowed
Quality measure
To select the quality measure according to which the split is calculated. Available are the "Gini Index" and the "Gain Ratio".
Pruning method
Pruning reduces tree size and avoids overfitting which increases the generalization performance, and thus, the prediction quality (for predictions, use the "Decision Tree Predictor" node). Available is the "Minimal Description Length" (MDL) pruning or it can also be switched off.
Reduced Error Pruning
If checked (default), a simple pruning method is used to cut the tree in a post-processing step: Starting at the leaves, each node is replaced with its most popular class, but only if the prediction accuracy doesn't decrease. Reduced error pruning has the advantage of simplicity and speed.
Min number records per node
To select the minimum number of records at least required in each node. If the number of records is smaller or equal to this number the tree is not grown any further. This corresponds to a stopping criteria (pre pruning).
Number records to store for view
To select the number of records stored in the tree for the view. The records are necessary to enable highlighting.
Average split point
If checked (default), the split value for numeric attributes is determined according to the mean value of the two attribute values that separate the two partitions. If unchecked, the split value is set to the largest value of the lower partition (like C4.5).
Number threads
This node can exploit multiple threads and thus multiple processors or cores. This can improve performance. The default value is set to the number of processors or cores that are available to KNIME. If set to 1, the algorithm is performed sequentially.
Skip nominal columns without domain information
If checked, nominal columns containing no domain value information are skipped. This is generally the case for nominal columns that have too many different values.
Force root split column
If checked, the first split is calculated on the chosen column without evaluating any other column for possible splits. This is sometimes useful if the user has additional information as to which column is best to split on even if it does not have the numeric best split quality. In case the selected column does not contain valid splits (e.g. because it has a constant value in all rows) a warning message will be displayed. If uncertain, leave unselected.
Binary nominal splits
If checked, nominal attributes are split in a binary fashion. Binary splits are more difficult to calculate but result also in more accurate trees. The nominal values are divided in two subsets (one for each child). If unchecked, for each nominal value one child is created.
Max #nominal
The subsets for the binary nominal splits are difficult to calculate. To find the best subsets for n nominal values there must be performed 2^n calculations. In case of many different nominal values this can be prohibitive expensive. Thus the maximum number of nominal values can be defined for which all possible subsets are calculated. Above this threshold, a heuristic is applied that first calculates the best nominal value for the second partition, then the second best value, and so on; until no improvement can be achieved.
Filter invalid attribute values in child nodes
Binary splits on nominal values may lead to tests for attribute values, which have been filtered out by a parent tree node. This is due to the fact that the learning algorithm is consistently using the table's domain information instead of the data in a tree node to define the split sets. These duplicate checks do not harm (the tree is the same and and will classify unknown data the exact same way), though they are confusing when the tree is inspected in the tree viewer. Enabling this option will post-process the tree and filter invalid checks.
No true child strategy
If the scoring reaches a node, at which its attributes value is unknown, one of the two following strategies can be used:
returnNullPrediciton: predict a missing value
returnLastPrediction: return the majority class of the last node
Missing value strategy
If there are missing values in the data to be predicted, a strategy can be chosen how to handle them:
lastPrediction: use the last known class
defaultChild: use the default child and continue traversing its path
NONE: use the noTrueChildStrategy

Input Ports

Icon
The pre-classified data that should be used to induce the decision tree. At least one attribute must be nominal.

Output Ports

Icon
The induced decision tree. The model can be used to classify data with unknown target (class) attribute. To do so, connect the model out port to the "Decision Tree Predictor" node.

Views

Decision Tree View
Visualizes the learned decision tree. The tree can be expanded and collapsed with the plus/minus signs.
Decision Tree View (simple)
Visualizes the learned decision tree. The tree can be expanded and collapsed with the plus/minus signs. The squared brackets show the splitting criteria. This is the attribute name on which the parent node was split and the value (numeric) and nominal value (set) that has led to this child. The class value in single quotes states the majority class in this node. The value in round brackets states (x of y) where x is the quantity of the majority class and y is the total count of examples in this node. The bar with the black border and partly filled with yellow color represents the amount of records that belongs to the node relatively to the parent node. The colored pie chart renders the distribution of the color attribute associated with the input data table. NOTE: the colors not necessarily reflect the class attribute. If the color distribution and the target attribute should correspond to each other, ensure that the "Color Manager" node colors the same attribute as selected in this decision tree node as target attribute.

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.