Bootstrap options
The bootstrap_type
parameter affects the following important aspects of choosing a split for a tree when building the tree structure:
-
Regularization
To prevent overfitting, the weight of each training example is varied over steps of choosing different splits (not over scoring different candidates for one split) or different trees.
-
Speeding up
When building a new tree, CatBoost calculates a score for each of the numerous split candidates. The computation complexity of this procedure is , where:
-
is the number of numerical features, each providing many split candidates.
-
is the number of examples.
Usually, this computation dominates over all other steps of each CatBoost iteration (see Table 1 in the CatBoost: unbiased boosting with categorical features). Hence, it seems appealing to speed up this procedure by using only a part of examples for scoring all the split candidates.
-
Depending on the value of the bootstrap_type
parameter, these ideas are implemented as described in the list below:
Bootstrap types
Bayesian
The weight of an example is set to the following value:
- is defined by the
bagging_temperature
parameter - , where is independently generated from Uniform[0,1] . This is equivalent to generating values for all the examples according to the Bayesian bootstrap procedure (see D. Rubin “The Bayesian Bootstrap”, 1981, Section 2).
Note
The Bayesian bootstrap serves only for the regularization, not for speeding up.
Associated parameters:
bagging_temperature
Command-line version: --bagging-temperature
Defines the settings of the Bayesian bootstrap. It is used by default in classification and regression modes.
Use the Bayesian bootstrap to assign random weights to objects.
The weights are sampled from exponential distribution if the value of this parameter is set to 1
. All weights are equal to 1 if the value of this parameter is set to 0
.
Possible values are in the range . The higher the value the more aggressive the bagging is.
This parameter can be used if the selected bootstrap type is Bayesian.
sampling_unit
Command-line version: --sampling-unit
The sampling scheme.
Possible values:
- Object — The weight of the i-th object is used for sampling the corresponding object.
- Group — The weight of the group is used for sampling each object from the group .
Bernoulli
Corresponds to Stochastic Gradient Boosting (SGB, refer to the Stochastic gradient boosting for details). Each example is independently sampled for choosing the current split with the probability defined by the subsample
parameter. All the sampled examples have equal weights. Though SGB was originally proposed for regularization, it speeds up calculations almost times.
Associated parameters:
subsample
Command-line version: --bagging-temperature
Sample rate for bagging.
This parameter can be used if one of the following bootstrap types is selected:
- Poisson
- Bernoulli
- MVS
sampling_unit
Command-line version: --sampling-unit
The sampling scheme.
Possible values:
- Object — The weight of the i-th object is used for sampling the corresponding object.
- Group — The weight of the group is used for sampling each object from the group .
MVS
Implements the importance sampling algorithm called Minimum Variance Sampling (MVS).
Scoring of a split candidate is based on estimating of the expected gradient in each leaf (provided by this candidate), where the gradient for the example is calculated as follows:
, where
- is the loss function
- is the target of the example
- is the current model prediction for the example (see the Algorithm 2 in the CatBoost: unbiased boosting with categorical features).
For this estimation, MVS samples the subsample examples such that the largest values of are taken with probability and each other example is sampled with probability , where is the threshold for considering the gradient to be large if the value is exceeded.
Then, the estimate of the expected gradient is calculated as follows:
- The numerator is the unbiased estimator of the sum of gradients.
- The denominator is the unbiased estimator of the number of training samples.
Theoretical basis
This algorithm provides the minimum variance estimation of the L2 split score for a given expected number of sampled examples:
.
Since the score is a fractional function, it is important to reduce the variance of both the numerator and the denominator. The mvs_reg
(--mvs-reg
) hyperparameter affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to - Bernoulli).. If it is not set manually, the value is set based on the gradient distribution on the current iteration.
MVS can be considered as an improved version of the Gradient-based One-Side Sampling (GOSS, see details in the
paper) implemented in LightGBM, which samples a given number of top examples by values with the probability 1 and samples other examples with the same fixed probability. Due to the theoretical basis, MVS provides a lower variance of the estimate than GOSS.
Note
MVS may not be the best choice for regularization, since sampled examples and their weights are similar for close iterations.
Associated parameters:
mvs_reg
Command-line version: --mvs-reg
Affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to - Bernoulli).
sampling_unit
Command-line version: --sampling-unit
The sampling scheme.
Possible values:
- Object — The weight of the i-th object is used for sampling the corresponding object.
- Group — The weight of the group is used for sampling each object from the group .
Poisson
Refer to the paper for details; supported only on GPU)
The weights of examples are i.i.d. sampled from the Poisson distribution with the parameter providing the expected number of examples with positive weights equal to the subsample
parameter . If subsample
is equal to 0.66, this approximates the classical bootstrap (sampling examples with repetitions).
Associated parameters:
subsample
Command-line version: --bagging-temperature
Sample rate for bagging.
This parameter can be used if one of the following bootstrap types is selected:
- Poisson
- Bernoulli
- MVS
sampling_unit
Command-line version: --sampling-unit
The sampling scheme.
Possible values:
- Object — The weight of the i-th object is used for sampling the corresponding object.
- Group — The weight of the group is used for sampling each object from the group .
No
All training examples are used with equal weights.
Associated parameters:
sampling_unit
Command-line version: --sampling-unit
The sampling scheme.
Possible values:
- Object — The weight of the i-th object is used for sampling the corresponding object.
- Group — The weight of the group is used for sampling each object from the group .
Frequency of resampling and reweighting
The frequency of resampling and reweighting is defined by the sampling_frequency
parameter:
-
PerTree — Before constructing each new tree
-
PerTreeLevel — Before choosing each new split of a tree
Note
It is recommended to use MVS when speeding up is an important issue and regularization is not. It is usually the case when operating large data. For regularization, other options might be more appropriate.
Related papers
Estimating Uncertainty for Massive Data Streams
N. Chamandy, O. Muralidharan, A. Najmi, and S. Naid, 2012
Stochastic gradient boosting
J. H. Friedman
Computational Statistics & Data Analysis, 38(4):367–378, 2002
Training Deep Models Faster with Robust, Approximate Importance Sampling
T. B. Johnson and C. Guestrin
In Advances in Neural Information Processing Systems, pages 7276–7286, 2018.
Lightgbm: A highly efficient gradient boosting decision tree
G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu..
In Advances in Neural Information Processing Systems, pages 3146–3154, 2017.
CatBoost: unbiased boosting with categorical features
Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, Andrey Gulin. NeurIPS, 2018
NeurIPS 2018 paper with explanation of Ordered boosting principles and ordered categorical features statistics.