Developer Guide and Reference

  • 2021.6
  • 04/11/2022
  • Public Content
Contents

Decision Forest Classification and Regression (DF)

Decision Forest (DF) classification and regression algorithms are based on an ensemble of tree-structured classifiers, which are known as decision trees. Decision forest is built using the general technique of bagging, a bootstrap aggregation, and a random choice of features. For more details, see [Breiman84] and [Breiman2001].

Mathematical formulation

Training
Given LaTex Math image. feature vectors LaTex Math image. of size LaTex Math image., their non-negative observation weights LaTex Math image. and LaTex Math image. responses LaTex Math image.,
Classification
  • LaTex Math image., where LaTex Math image. is the number of classes
Regression
  • LaTex Math image.
the problem is to build a decision forest classification or regression model.
The library uses the following algorithmic framework for the training stage. Let LaTex Math image. be the set of observations. Given positive integer parameters, such as the number of trees LaTex Math image., the bootstrap parameter LaTex Math image., where LaTex Math image. is a fraction of observations used for a training of each tree in the forest, and the number of features per node LaTex Math image., the algorithm does the following for LaTex Math image.:
  • Selects randomly with replacement the set LaTex Math image. of LaTex Math image. vectors from the set LaTex Math image.. The set LaTex Math image. is called a
    bootstrap
    set.
  • Trains a decision tree classifier LaTex Math image. on LaTex Math image. using parameter LaTex Math image. for each tree.
Decision tree LaTex Math image. is trained using the training set LaTex Math image. of size LaTex Math image.. Each node LaTex Math image. in the tree corresponds to the subset LaTex Math image. of the training set LaTex Math image., with the root node being LaTex Math image. itself. Each internal node LaTex Math image. represents a binary test (split) that divides the subset LaTex Math image. in two subsets, LaTex Math image. and LaTex Math image., corresponding to their children, LaTex Math image. and LaTex Math image..
Training method:
Dense
In
dense
training method, all possible splits for each feature are taken from the subset of selected features for the current node and evaluated for best split computation.
Training method:
Hist
In
hist
training method, only a selected subset of splits is considered for best split computation. This subset of splits is computed for each feature at the initialization stage of the algorithm. After computing the subset of splits, each value from the initially provided data is substituted with the value of the corresponding bin. Bins are continuous intervals between selected splits.
Split Criteria
The metric for measuring the best split is called
impurity
, LaTex Math image.. It generally reflects the homogeneity of responses within the subset LaTex Math image. in the node LaTex Math image..
Classification
Gini index
is an impurity metric for classification, calculated as follows:
LaTex Math image.
where
  • LaTex Math image. is a set of observations that reach the node;
  • LaTex Math image. is specified in the table below:
Decision Forest Split Criteria Calculation
Without sample weights
With sample weights
LaTex Math image. is the observed fraction of observations that belong to class LaTex Math image. in LaTex Math image.
LaTex Math image. is the observed weighted fraction of observations that belong to class LaTex Math image. in LaTex Math image.:
LaTex Math image.
Regression
MSE
is an impurity metric for regression, calculated as follows:
MSE Impurity Metric
Without sample weights
With sample weights
LaTex Math image.
LaTex Math image.
LaTex Math image., which is equivalent to the number of elements in LaTex Math image.
LaTex Math image.
Let the
impurity decrease
in the node LaTex Math image. be
LaTex Math image.
Termination Criteria
The library supports the following termination criteria of decision forest training:
Minimal number of observations in a leaf node
Node LaTex Math image. is not processed if LaTex Math image. is smaller than the predefined value. Splits that produce nodes with the number of observations smaller than that value are not allowed.
Minimal number of observations in a split node
Node LaTex Math image. is not processed if LaTex Math image. is smaller than the predefined value. Splits that produce nodes with the number of observations smaller than that value are not allowed.
Minimum weighted fraction of the sum total of weights of all the input observations required to be at a leaf node
Node LaTex Math image. is not processed if LaTex Math image. is smaller than the predefined value. Splits that produce nodes with the number of observations smaller than that value are not allowed.
Maximal tree depth
Node LaTex Math image. is not processed if its depth in the tree reached the predefined value.
Impurity threshold
Node LaTex Math image. is not processed if its impurity is smaller than the predefined threshold.
Maximal number of leaf nodes
Grow trees with positive maximal number of leaf nodes in a best-first fashion. Best nodes are defined by relative reduction in impurity. If maximal number of leaf nodes equals zero, then this criterion does not limit the number of leaf nodes, and trees grow in a depth-first fashion.
Tree Building Strategies
Maximal number of leaf nodes defines the strategy of tree building: depth-first or best-first.
Depth-first Strategy
If maximal number of leaf nodes equals zero, a decision tree is built using depth-first strategy. In each terminal node LaTex Math image., the following recursive procedure is applied:
  • Stop if the termination criteria are met.
  • Choose randomly without replacement LaTex Math image. feature indices LaTex Math image..
  • For each LaTex Math image., find the best split LaTex Math image. that partitions subset LaTex Math image. and maximizes impurity decrease LaTex Math image..
  • A node is a split if this split induces a decrease of the impurity greater than or equal to the predefined value. Get the best split LaTex Math image. that maximizes impurity decrease LaTex Math image. in all LaTex Math image. splits.
  • Apply this procedure recursively to LaTex Math image. and LaTex Math image..
Best-first Strategy
If maximal number of leaf nodes is positive, a decision tree is built using best-first strategy. In each terminal node LaTex Math image., the following steps are applied:
  • Stop if the termination criteria are met.
  • Choose randomly without replacement LaTex Math image. feature indices LaTex Math image..
  • For each LaTex Math image., find the best split LaTex Math image. that partitions subset LaTex Math image. and maximizes impurity decrease LaTex Math image..
  • A node is a split if this split induces a decrease of the impurity greater than or equal to the predefined value and the number of split nodes is less or equal to LaTex Math image.. Get the best split LaTex Math image. that maximizes impurity decrease LaTex Math image. in all LaTex Math image. splits.
  • Put a node into a sorted array, where sort criterion is the improvement in impurity LaTex Math image.. The node with maximal improvement is the first in the array. For a leaf node, the improvement in impurity is zero.
  • Apply this procedure to LaTex Math image. and LaTex Math image. and grow a tree one by one getting the first element from the array until the array is empty.
Inference
Given decision forest classification or regression model and vectors LaTex Math image., the problem is to calculate the responses for those vectors.
Inference methods:
Dense
and
Hist
Dense
and
hist
inference methods perform prediction in the same way. To solve the problem for each given query vector LaTex Math image., the algorithm does the following:
Classification
For each tree in the forest, it finds the leaf node that gives LaTex Math image. its label. The label LaTex Math image. that the majority of trees in the forest vote for is chosen as the predicted label for the query vector LaTex Math image..
Regression
For each tree in the forest, it finds the leaf node that gives LaTex Math image. the response as the mean of dependent variables. The mean of responses from all trees in the forest is the predicted response for the query vector LaTex Math image..
Additional Characteristics Calculated by the Decision Forest
Decision forests can produce additional characteristics, such as an estimate of generalization error and an importance measure (relative decisive power) of each of p features (variables).
Out-of-bag Error
The estimate of the generalization error based on the training data can be obtained and calculated as follows:
Classification
  • For each vector LaTex Math image. in the dataset LaTex Math image., predict its label LaTex Math image. by having the majority of votes from the trees that contain LaTex Math image. in their OOB set, and vote for that label.
  • Calculate the OOB error of the decision forest LaTex Math image. as the average of misclassifications:
    LaTex Math image.
  • If OOB error value per each observation is required, then calculate the prediction error for LaTex Math image.: LaTex Math image.
Regression
  • For each vector LaTex Math image. in the dataset LaTex Math image., predict its response LaTex Math image. as the mean of prediction from the trees that contain LaTex Math image. in their OOB set:
    LaTex Math image., where LaTex Math image. and LaTex Math image. is the result of prediction LaTex Math image. by LaTex Math image..
  • Calculate the OOB error of the decision forest LaTex Math image. as the Mean-Square Error (MSE):
    LaTex Math image.
  • If OOB error value per each observation is required, then calculate the prediction error for LaTex Math image.:
    LaTex Math image.
Variable Importance
There are two main types of variable importance measures:
  • Mean Decrease Impurity
    importance (MDI)
    Importance of the LaTex Math image.-th variable for predicting LaTex Math image. is the sum of weighted impurity decreases LaTex Math image. for all nodes LaTex Math image. that use LaTex Math image., averaged over all LaTex Math image. trees in the forest:
    LaTex Math image.
    where LaTex Math image. is the fraction of observations reaching node LaTex Math image. in the tree LaTex Math image., and LaTex Math image. is the index of the variable used in split LaTex Math image..
  • Mean Decrease Accuracy
    (MDA)
    Importance of the LaTex Math image.-th variable for predicting LaTex Math image. is the average increase in the OOB error over all trees in the forest when the values of the LaTex Math image.-th variable are randomly permuted in the OOB set. For that reason, this latter measure is also known as
    permutation importance
    .
    In more details, the library calculates MDA importance as follows:
    • Let LaTex Math image. be the set of feature vectors where the LaTex Math image.-th variable is randomly permuted over all vectors in the set.
    • Let LaTex Math image. be the OOB error calculated for LaTex Math image. on its out-of-bag dataset LaTex Math image..
    • Let LaTex Math image. be the OOB error calculated for LaTex Math image. using LaTex Math image., and its out-of-bag dataset LaTex Math image. is permuted on the LaTex Math image.-th variable. Then
      • LaTex Math image. is the OOB error increase for the tree LaTex Math image..
      • LaTex Math image. is MDA importance.
      • LaTex Math image., where LaTex Math image. is the variance of LaTex Math image.

Distributed mode

The algorithm supports distributed execution in SMPD mode (only on GPU).

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.