Bootstrap options

The bootstrap_type parameter affects the following important aspects of choosing a split for a tree when building the tree structure:

  • Regularization

    To prevent overfitting, the weight of each training example is varied over steps of choosing different splits (not over scoring different candidates for one split) or different trees.

  • Speeding up

    When building a new tree, CatBoost calculates a score for each of the numerous split candidates. The computation complexity of this procedure is O(Cn)O(|C|\cdot n), where:

    • C|C| is the number of numerical features, each providing many split candidates.

    • nn is the number of examples.

    Usually, this computation dominates over all other steps of each CatBoost iteration (see Table 1 in the CatBoost: unbiased boosting with categorical features). Hence, it seems appealing to speed up this procedure by using only a part of examples for scoring all the split candidates.

Depending on the value of the bootstrap_type parameter, these ideas are implemented as described in the list below:

Bootstrap types

Bayesian

The weight of an example is set to the following value:

w=at,where:w=a^{t} {, where:}

  • tt is defined by the bagging_temperature parameter
  • a=log(ψ)a=-log(\psi), where ψ\psi is independently generated from Uniform[0,1] . This is equivalent to generating values aa for all the examples according to the Bayesian bootstrap procedure (see D. Rubin “The Bayesian Bootstrap”, 1981, Section 2).

Note

The Bayesian bootstrap serves only for the regularization, not for speeding up.

Associated parameters:

bagging_temperature

Command-line version: --bagging-temperature

Defines the settings of the Bayesian bootstrap. It is used by default in classification and regression modes.

Use the Bayesian bootstrap to assign random weights to objects.

The weights are sampled from exponential distribution if the value of this parameter is set to 1. All weights are equal to 1 if the value of this parameter is set to 0.

Possible values are in the range [0;inf)[0; \inf). The higher the value the more aggressive the bagging is.

This parameter can be used if the selected bootstrap type is Bayesian.

sampling_unit

Command-line version: --sampling-unit

The sampling scheme.

Possible values:

  • Object — The weight wiw_{i} of the i-th object oio_{i} is used for sampling the corresponding object.
  • Group — The weight wjw_{j} of the group gjg_{j} is used for sampling each object oijo_{i_{j}} from the group gjg_{j}.

Bernoulli

Corresponds to Stochastic Gradient Boosting (SGB, refer to the Stochastic gradient boosting for details). Each example is independently sampled for choosing the current split with the probability defined by the subsample parameter. All the sampled examples have equal weights. Though SGB was originally proposed for regularization, it speeds up calculations almost (1subsample)\left(\frac{1}{subsample}\right) times.

Associated parameters:

subsample

Command-line version: --bagging-temperature

Sample rate for bagging.

This parameter can be used if one of the following bootstrap types is selected:

  • Poisson
  • Bernoulli
  • MVS
sampling_unit

Command-line version: --sampling-unit

The sampling scheme.

Possible values:

  • Object — The weight wiw_{i} of the i-th object oio_{i} is used for sampling the corresponding object.
  • Group — The weight wjw_{j} of the group gjg_{j} is used for sampling each object oijo_{i_{j}} from the group gjg_{j}.

MVS

Implements the importance sampling algorithm called Minimum Variance Sampling (MVS).

Scoring of a split candidate is based on estimating of the expected gradient in each leaf (provided by this candidate), where the gradient gig_{i} for the example ii is calculated as follows:

gi=L(yi,z)zz=M(i)g_{i} = \frac{\partial L(y_{i}, z)}{\partial z}|_ {z=M(i)}, where

For this estimation, MVS samples the subsample examples ii such that the largest values of gi|g_i| are taken with probability pi=1p_{i}=1 and each other example ii is sampled with probability giμ\displaystyle\frac{|g_i|}{\mu}, where μ\mu is the threshold for considering the gradient to be large if the value is exceeded.

Then, the estimate of the expected gradient is calculated as follows:

Eg^=i: sampled examplesgipii: sampled examples1pi,where\hat{E\, g} = \frac{\sum_{i:\ sampled\ examples} \frac{g_i}{p_i}}{\sum_{i:\ sampled\ examples} \frac{1}{p_i}} { , where}

  • The numerator is the unbiased estimator of the sum of gradients.
  • The denominator is the unbiased estimator of the number of training samples.

Theoretical basis

This algorithm provides the minimum variance estimation of the L2 split score for a given expected number of sampled examples:

s=i: all training examplespis=\sum_{i:\ all\ training\ examples} p_{i}.

Since the score is a fractional function, it is important to reduce the variance of both the numerator and the denominator. The mvs_reg (--mvs-reg) hyperparameter affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to \infty - Bernoulli).. If it is not set manually, the value is set based on the gradient distribution on the current iteration.

MVS can be considered as an improved version of the Gradient-based One-Side Sampling (GOSS, see details in the
paper) implemented in LightGBM, which samples a given number of top examples by values gi|g_i| with the probability 1 and samples other examples with the same fixed probability. Due to the theoretical basis, MVS provides a lower variance of the estimate Eg^\hat{E\, g} than GOSS.

Note

MVS may not be the best choice for regularization, since sampled examples and their weights are similar for close iterations.

Associated parameters:

mvs_reg

Command-line version: --mvs-reg

Affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to \infty - Bernoulli).

sampling_unit

Command-line version: --sampling-unit

The sampling scheme.

Possible values:

  • Object — The weight wiw_{i} of the i-th object oio_{i} is used for sampling the corresponding object.
  • Group — The weight wjw_{j} of the group gjg_{j} is used for sampling each object oijo_{i_{j}} from the group gjg_{j}.

Poisson

Refer to the paper for details; supported only on GPU)

The weights of examples are i.i.d. sampled from the Poisson distribution with the parameter log(1subsample)-log(1-subsample) providing the expected number of examples with positive weights equal to the subsample parameter . If subsample is equal to 0.66, this approximates the classical bootstrap (sampling nn examples with repetitions).

Associated parameters:

subsample

Command-line version: --bagging-temperature

Sample rate for bagging.

This parameter can be used if one of the following bootstrap types is selected:

  • Poisson
  • Bernoulli
  • MVS
sampling_unit

Command-line version: --sampling-unit

The sampling scheme.

Possible values:

  • Object — The weight wiw_{i} of the i-th object oio_{i} is used for sampling the corresponding object.
  • Group — The weight wjw_{j} of the group gjg_{j} is used for sampling each object oijo_{i_{j}} from the group gjg_{j}.

No

All training examples are used with equal weights.

Associated parameters:

sampling_unit

Command-line version: --sampling-unit

The sampling scheme.

Possible values:

  • Object — The weight wiw_{i} of the i-th object oio_{i} is used for sampling the corresponding object.
  • Group — The weight wjw_{j} of the group gjg_{j} is used for sampling each object oijo_{i_{j}} from the group gjg_{j}.

Frequency of resampling and reweighting

The frequency of resampling and reweighting is defined by the sampling_frequency parameter:

  • PerTree — Before constructing each new tree

  • PerTreeLevel — Before choosing each new split of a tree

    Note

    It is recommended to use MVS when speeding up is an important issue and regularization is not. It is usually the case when operating large data. For regularization, other options might be more appropriate.

Estimating Uncertainty for Massive Data Streams

N. Chamandy, O. Muralidharan, A. Najmi, and S. Naid, 2012

Stochastic gradient boosting

J. H. Friedman

Computational Statistics & Data Analysis, 38(4):367–378, 2002

Training Deep Models Faster with Robust, Approximate Importance Sampling

T. B. Johnson and C. Guestrin

In Advances in Neural Information Processing Systems, pages 7276–7286, 2018.

Lightgbm: A highly efficient gradient boosting decision tree

G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu..

In Advances in Neural Information Processing Systems, pages 3146–3154, 2017.

CatBoost: unbiased boosting with categorical features

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, Andrey Gulin. NeurIPS, 2018

NeurIPS 2018 paper with explanation of Ordered boosting principles and ordered categorical features statistics.