TD_​DecisionForest

The TD_DecisionForest is an ensemble algorithm and widely used across a range of classification and regression predictive modeling problems. It is an extension of bootstrap aggregation (bagging) of decision trees. In bagging, a number of decision trees are created where each tree is created from a different bootstrap sample of the training dataset. A bootstrap sample is a sample of the training dataset where a sample may appear more than once, referred to as sampling with replacement. It also involves selecting a subset of input features (columns or variables) at each split point in the construction of trees. Typically, constructing a decision tree involves evaluating the value for each input variable in the data in order to select a split point. By reducing the features to a random subset that may be considered at each split point, it forces each decision tree in the ensemble to be more different. The TD_DecisionForest function uses a training data set to create a predictive model. You can input the model to the TD_DecisionForestPredict function, which uses it to make predictions. A prediction on a regression problem is the average of the prediction across the trees in the ensemble. A prediction on a classification problem is the majority vote for the class label across the trees in the ensemble.

Options

CoverageFactor
Specify the level of coverage for the dataset while building trees (in percentage, e.g., 1.25 = 125% coverage). CoverageFactor can only be used if NumTrees is not supplied. When NumTrees is specified, coverage depends on the value of NumTrees. If NumTrees is not specified, NumTrees is chosen to achieve this level of coverage. The default coverage value is 100% (1.0) if NumTrees argument is not supplied. Because of internal sampling in bootstrapping, some rows may be chosen multiple times, and some not at all. A higher coverage level will ensure a higher probability of each row in input data to be selected during the tree building process (at the cost of building more trees).
InputColumns
Specify the names of the input table columns that need to be used for training the model (predictors, features or independent variables).
MaxDepth
Specify a decision tree stopping criterion. If the tree reaches a depth past this value, the algorithm stops looking for splits. Decision trees can grow to (2^(max_depth+1)-1) nodes. This stopping criterion has the greatest effect on the performance of the function.
MinImpurity
Specify the minimum impurity at which the tree stops splitting further down. For regression, a criteria of squared error is used whereas for classification, gini impurity is used.
MinNodeSize
Specify the minimum number of observations in a tree node. The algorithm stops splitting a node if the number of observations in the node is equal to or smaller than this value. You must specify a non-negative integer value.
ModelType
Specify whether the analysis is a regression (continuous response variable) or a multiple-class classification (predicting result from the number of classes).
Mtry
Specify the number of variables to randomly sample from each input value. For example, if mtry is 3, then the function randomly samples 3 variables from each input at each split. The mtry must be an INTEGER. When mtry is -1, all variables will be used for each split.
MtrySeed
Specify an integer value to use in determining the random seed for mtry. By default, mtryseed is 1.
NumTrees
Specify the number of trees to grow in the forest model. When specified, the number of trees must be greater than or equal to the number of AMPs with data. By default, the function builds the minimum number of trees that provides the input data set with coverage based on coverageFactor.
ResponseColumn
Specify the name of the column that contains the class label for classification or target value (dependent variable) for regression.
Seed
Specify the random seed the algorithm uses for repeatable results. By default, seed is 1.
Output Schema
Output Schema, if Volatile is true then use user login as the schema.
Output Table
Output Table
VAL Location
VAL Location
Volatile
Specifies whether the table should be a VOLATILE table. If true, then the table is automatically deleted, otherwise it is users responsibility to remove or clean it up for space.
TreeSize
Specify the number of rows that each tree uses as its input data set. The function builds a tree using either the number of rows on an AMP, the number of rows that fit into the AMP's memory (whichever is less), or the number of rows given by the TreeSize argument. By default, this value is computed as the minimum of the number of rows on an AMP, and the number of rows that fit into the AMP's memory.

Input Ports

Icon
Connection to a Teradata Database Instance
Icon
Specifies the table containing the input data.

Output Ports

Icon
output of TD_DecisionForest

Nodes

Extensions

Links