Bootstrap options

The bootstrap_type parameter affects the following important aspects of choosing a split for a tree when building the tree structure:
  • Regularization

    To prevent overfitting, the weight of each training example is varied over steps of choosing different splits (not over scoring different candidates for one split) or different trees.

  • Speeding up

    When building a new tree, CatBoost calculates a score for each of the numerous split candidates. The computation complexity of this procedure is , where:

    • is the number of numerical features, each providing many split candidates.

    • is the number of examples.

    Usually, this computation dominates over all other steps of each CatBoost iteration (see Table 1 in the paper). Hence, it seems appealing to speed up this procedure by using only a part of examples for scoring all the split candidates.

Depending on the value of the bootstrap_type parameter, these ideas are implemented as described in the table below.
Bootstrap type Description Associated parameters
Bayesian

The weight of an example is set to the following value:

  • is defined by the bootstrap_type parameter
  • , where is independently generated from Uniform[0,1] . This is equivalent to generating values for all the examples according to the Bayesian bootstrap procedure (see D. Rubin “The Bayesian Bootstrap”, 1981, Section 2).
Note. The Bayesian bootstrap serves only for the regularization, not for speeding up.
bagging_temperature

Command-line version:

--bagging-temperature

Defines the settings of the Bayesian bootstrap. It is used by default in classification and regression modes.

Use the Bayesian bootstrap to assign random weights to objects.

The weights are sampled from exponential distribution if the value of this parameter is set to “1”. All weights are equal to 1 if the value of this parameter is set to “0”.

Possible values are in the range . The higher the value the more aggressive the bagging is.

This parameter can be used if the selected bootstrap type is Bayesian.

sampling_unit

Command-line version:

--sampling-unit

The sampling scheme.

Possible values:
  • Object — The weight  of the i-th object is used for sampling the corresponding object.
  • Group — The weight of the group is used for sampling each object  from the group .
Bernoulli

Corresponds to Stochastic Gradient Boosting (SGB, refer to the paper for details). Each example is independently sampled for choosing the current split with the probability defined by the subsample parameter. All the sampled examples have equal weights. Though SGB was originally proposed for regularization, it speeds up calculations almost times.

subsample

Command-line version:

--bagging-temperature

Sample rate for bagging.

This parameter can be used if one of the following bootstrap types is selected:
  • Poisson
  • Bernoulli
sampling_unit

Command-line version:

--sampling-unit

The sampling scheme.

Possible values:
  • Object — The weight  of the i-th object is used for sampling the corresponding object.
  • Group — The weight of the group is used for sampling each object  from the group .
MVS (supported only on CPU)

Implements the importance sampling algorithm called Minimum Variance Sampling (MVS).

Scoring of a split candidate is based on estimating of the expected gradient in each leaf (provided by this candidate), where the gradient for the example is calculated as follows:

, where
  • is the loss function
  • is the target of the example
  • is the current model prediction for the example (see the Algorithm 2 in the paper ).

For this estimation, MVS samples the mvs_head_fraction examples with the highest values of with probability and samples each other example with probability , where is the minimum of among top-mvs_head_fraction examples.

Then, the unbiased estimate of the expected gradient is calculated as follows:

is the number of examples in the leaf.

Theoretical basis

This algorithm provides the minimum variance for a given expected number of sampled examples:

.

If is equal to 1, this statement coincides with the case of Proposition 3.1 described in the paper, where (regularization is absent), the parameter vector is scalar (a value in the leaf under consideration) and , and the gradients are taken at :

Noteworthy, according to this statement, though examples are sampled once for all candidates for the current split, this sampling provides minimal variance of estimation for each leaf with some leaf-specific value of .

MVS can be considered as an improved version of the Gradient-based One-Side Sampling (GOSS, see details in the paper) implemented in LightGBM, which samples a given number of top examples by values with the probability 1 and samples other examples with the same fixed probability. Due to the theoretical basis, MVS provides a lower variance of the estimate than GOSS.

Note.

MVS may not be the best choice for regularization, since sampled examples and their weights are similar for close iterations.

mvs_head_fraction

Command-line version:

--mvs-head-fraction

Controls the fraction of the highest by absolute value gradients taken for the minimal variance sampling. Possible values are in the range .

This parameter can be used if the selected bootstrap type is MVS.

sampling_unit

Command-line version:

--sampling-unit

The sampling scheme.

Possible values:
  • Object — The weight  of the i-th object is used for sampling the corresponding object.
  • Group — The weight of the group is used for sampling each object  from the group .
Poisson (refer to the paper for details; supported only on GPU)

The weights of examples are i.i.d. sampled from the Poisson distribution with the parameter providing the expected number of examples with positive weights equal to the subsample parameter . If subsample is equal to 0.66, this approximates the classical bootstrap (sampling examples with repetitions).

subsample

Command-line version:

--bagging-temperature

Sample rate for bagging.

This parameter can be used if one of the following bootstrap types is selected:
  • Poisson
  • Bernoulli
sampling_unit

Command-line version:

--sampling-unit

The sampling scheme.

Possible values:
  • Object — The weight  of the i-th object is used for sampling the corresponding object.
  • Group — The weight of the group is used for sampling each object  from the group .
No

All training examples are used with equal weights.

sampling_unit

Command-line version:

--sampling-unit

The sampling scheme.

Possible values:
  • Object — The weight  of the i-th object is used for sampling the corresponding object.
  • Group — The weight of the group is used for sampling each object  from the group .
Bootstrap type Description Associated parameters
Bayesian

The weight of an example is set to the following value:

  • is defined by the bootstrap_type parameter
  • , where is independently generated from Uniform[0,1] . This is equivalent to generating values for all the examples according to the Bayesian bootstrap procedure (see D. Rubin “The Bayesian Bootstrap”, 1981, Section 2).
Note. The Bayesian bootstrap serves only for the regularization, not for speeding up.
bagging_temperature

Command-line version:

--bagging-temperature

Defines the settings of the Bayesian bootstrap. It is used by default in classification and regression modes.

Use the Bayesian bootstrap to assign random weights to objects.

The weights are sampled from exponential distribution if the value of this parameter is set to “1”. All weights are equal to 1 if the value of this parameter is set to “0”.

Possible values are in the range . The higher the value the more aggressive the bagging is.

This parameter can be used if the selected bootstrap type is Bayesian.

sampling_unit

Command-line version:

--sampling-unit

The sampling scheme.

Possible values:
  • Object — The weight  of the i-th object is used for sampling the corresponding object.
  • Group — The weight of the group is used for sampling each object  from the group .
Bernoulli

Corresponds to Stochastic Gradient Boosting (SGB, refer to the paper for details). Each example is independently sampled for choosing the current split with the probability defined by the subsample parameter. All the sampled examples have equal weights. Though SGB was originally proposed for regularization, it speeds up calculations almost times.

subsample

Command-line version:

--bagging-temperature

Sample rate for bagging.

This parameter can be used if one of the following bootstrap types is selected:
  • Poisson
  • Bernoulli
sampling_unit

Command-line version:

--sampling-unit

The sampling scheme.

Possible values:
  • Object — The weight  of the i-th object is used for sampling the corresponding object.
  • Group — The weight of the group is used for sampling each object  from the group .
MVS (supported only on CPU)

Implements the importance sampling algorithm called Minimum Variance Sampling (MVS).

Scoring of a split candidate is based on estimating of the expected gradient in each leaf (provided by this candidate), where the gradient for the example is calculated as follows:

, where
  • is the loss function
  • is the target of the example
  • is the current model prediction for the example (see the Algorithm 2 in the paper ).

For this estimation, MVS samples the mvs_head_fraction examples with the highest values of with probability and samples each other example with probability , where is the minimum of among top-mvs_head_fraction examples.

Then, the unbiased estimate of the expected gradient is calculated as follows:

is the number of examples in the leaf.

Theoretical basis

This algorithm provides the minimum variance for a given expected number of sampled examples:

.

If is equal to 1, this statement coincides with the case of Proposition 3.1 described in the paper, where (regularization is absent), the parameter vector is scalar (a value in the leaf under consideration) and , and the gradients are taken at :

Noteworthy, according to this statement, though examples are sampled once for all candidates for the current split, this sampling provides minimal variance of estimation for each leaf with some leaf-specific value of .

MVS can be considered as an improved version of the Gradient-based One-Side Sampling (GOSS, see details in the paper) implemented in LightGBM, which samples a given number of top examples by values with the probability 1 and samples other examples with the same fixed probability. Due to the theoretical basis, MVS provides a lower variance of the estimate than GOSS.

Note.

MVS may not be the best choice for regularization, since sampled examples and their weights are similar for close iterations.

mvs_head_fraction

Command-line version:

--mvs-head-fraction

Controls the fraction of the highest by absolute value gradients taken for the minimal variance sampling. Possible values are in the range .

This parameter can be used if the selected bootstrap type is MVS.

sampling_unit

Command-line version:

--sampling-unit

The sampling scheme.

Possible values:
  • Object — The weight  of the i-th object is used for sampling the corresponding object.
  • Group — The weight of the group is used for sampling each object  from the group .
Poisson (refer to the paper for details; supported only on GPU)

The weights of examples are i.i.d. sampled from the Poisson distribution with the parameter providing the expected number of examples with positive weights equal to the subsample parameter . If subsample is equal to 0.66, this approximates the classical bootstrap (sampling examples with repetitions).

subsample

Command-line version:

--bagging-temperature

Sample rate for bagging.

This parameter can be used if one of the following bootstrap types is selected:
  • Poisson
  • Bernoulli
sampling_unit

Command-line version:

--sampling-unit

The sampling scheme.

Possible values:
  • Object — The weight  of the i-th object is used for sampling the corresponding object.
  • Group — The weight of the group is used for sampling each object  from the group .
No

All training examples are used with equal weights.

sampling_unit

Command-line version:

--sampling-unit

The sampling scheme.

Possible values:
  • Object — The weight  of the i-th object is used for sampling the corresponding object.
  • Group — The weight of the group is used for sampling each object  from the group .

The frequency of resampling and reweighting is defined by the sampling_frequency parameter:

  • PerTree — Before constructing each new tree
  • PerTreeLevel — Before choosing each new split of a tree
Tip.

It is recommended to use MVS when speeding up is an important issue and regularization is not. It is usually the case when operating large data. For regularization, other options might be more appropriate.

Related papers

Estimating Uncertainty for Massive Data Streams

N. Chamandy, O. Muralidharan, A. Najmi, and S. Naid, 2012

Stochastic gradient boosting

J. H. Friedman

Computational Statistics & Data Analysis, 38(4):367–378, 2002

Training Deep Models Faster with Robust, Approximate Importance Sampling

T. B. Johnson and C. Guestrin

In Advances in Neural Information Processing Systems, pages 7276–7286, 2018.

Lightgbm: A highly efficient gradient boosting decision tree

G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu..

In Advances in Neural Information Processing Systems, pages 3146–3154, 2017.

CatBoost: unbiased boosting with categorical features

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, Andrey Gulin. NeurIPS, 2018

NeurIPS 2018 paper with explanation of Ordered boosting principles and ordered categorical features statistics.