# Bootstrap options

The bootstrap_type parameter affects the following important aspects of choosing a split for a tree when building the tree structure:

• Regularization

To prevent overfitting, the weight of each training example is varied over steps of choosing different splits (not over scoring different candidates for one split) or different trees.

• Speeding up

When building a new tree, CatBoost calculates a score for each of the numerous split candidates. The computation complexity of this procedure is $O(|C|\cdot n)$, where:

• $|C|$ is the number of numerical features, each providing many split candidates.

• $n$ is the number of examples.

Usually, this computation dominates over all other steps of each CatBoost iteration (see Table 1 in the CatBoost: unbiased boosting with categorical features). Hence, it seems appealing to speed up this procedure by using only a part of examples for scoring all the split candidates.

Depending on the value of the bootstrap_type parameter, these ideas are implemented as described in the list below:

## Bootstrap typesBootstrap types

### BayesianBayesian

The weight of an example is set to the following value:

$w=a^{t} {, where:}$

• $t$ is defined by the bagging_temperature parameter
• $a=-log(\psi)$, where $\psi$ is independently generated from Uniform[0,1] . This is equivalent to generating values $a$ for all the examples according to the Bayesian bootstrap procedure (see D. Rubin “The Bayesian Bootstrap”, 1981, Section 2).

Note

The Bayesian bootstrap serves only for the regularization, not for speeding up.

Associated parameters:

bagging_temperature

Command-line version: --bagging-temperature

Defines the settings of the Bayesian bootstrap. It is used by default in classification and regression modes.

Use the Bayesian bootstrap to assign random weights to objects.

The weights are sampled from exponential distribution if the value of this parameter is set to 1. All weights are equal to 1 if the value of this parameter is set to 0.

Possible values are in the range $[0; \inf)$. The higher the value the more aggressive the bagging is.

This parameter can be used if the selected bootstrap type is Bayesian.

sampling_unit

Command-line version: --sampling-unit

The sampling scheme.

Possible values:

• Object — The weight $w_{i}$ of the i-th object $o_{i}$ is used for sampling the corresponding object.
• Group — The weight $w_{j}$ of the group $g_{j}$ is used for sampling each object $o_{i_{j}}$ from the group $g_{j}$.

### BernoulliBernoulli

Corresponds to Stochastic Gradient Boosting (SGB, refer to the Stochastic gradient boosting for details). Each example is independently sampled for choosing the current split with the probability defined by the subsample parameter. All the sampled examples have equal weights. Though SGB was originally proposed for regularization, it speeds up calculations almost $\left(\frac{1}{subsample}\right)$ times.

Associated parameters:

subsample

Command-line version: --bagging-temperature

Sample rate for bagging.

This parameter can be used if one of the following bootstrap types is selected:

• Poisson
• Bernoulli
• MVS
sampling_unit

Command-line version: --sampling-unit

The sampling scheme.

Possible values:

• Object — The weight $w_{i}$ of the i-th object $o_{i}$ is used for sampling the corresponding object.
• Group — The weight $w_{j}$ of the group $g_{j}$ is used for sampling each object $o_{i_{j}}$ from the group $g_{j}$.

### MVSMVS

Implements the importance sampling algorithm called Minimum Variance Sampling (MVS).

Scoring of a split candidate is based on estimating of the expected gradient in each leaf (provided by this candidate), where the gradient $g_{i}$ for the example $i$ is calculated as follows:

$g_{i} = \frac{\partial L(y_{i}, z)}{\partial z}|_ {z=M(i)}$, where

• $L$ is the loss function
• $y_{i}$ is the target of the example $i$
• $M(i)$ is the current model prediction for the example $i$ (see the Algorithm 2 in the CatBoost: unbiased boosting with categorical features).

For this estimation, MVS samples the subsample examples $i$ such that the largest values of $|g_i|$ are taken with probability $p_{i}=1$ and each other example $i$ is sampled with probability $\displaystyle\frac{|g_i|}{\mu}$, where $\mu$ is the threshold for considering the gradient to be large if the value is exceeded.

Then, the estimate of the expected gradient is calculated as follows:

$\hat{E\, g} = \frac{\sum_{i:\ sampled\ examples} \frac{g_i}{p_i}}{\sum_{i:\ sampled\ examples} \frac{1}{p_i}} { , where}$

• The numerator is the unbiased estimator of the sum of gradients.
• The denominator is the unbiased estimator of the number of training samples.

#### Theoretical basisTheoretical basis

This algorithm provides the minimum variance estimation of the L2 split score for a given expected number of sampled examples:

$s=\sum_{i:\ all\ training\ examples} p_{i}$.

Since the score is a fractional function, it is important to reduce the variance of both the numerator and the denominator. The mvs_reg (--mvs-reg) hyperparameter affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to $\infty$ - Bernoulli).. If it is not set manually, the value is set based on the gradient distribution on the current iteration.

MVS can be considered as an improved version of the Gradient-based One-Side Sampling (GOSS, see details in the
paper) implemented in LightGBM, which samples a given number of top examples by values $|g_i|$ with the probability 1 and samples other examples with the same fixed probability. Due to the theoretical basis, MVS provides a lower variance of the estimate $\hat{E\, g}$ than GOSS.

Note

MVS may not be the best choice for regularization, since sampled examples and their weights are similar for close iterations.

Associated parameters:

mvs_reg

Command-line version: --mvs-reg

Affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to $\infty$ - Bernoulli).

sampling_unit

Command-line version: --sampling-unit

The sampling scheme.

Possible values:

• Object — The weight $w_{i}$ of the i-th object $o_{i}$ is used for sampling the corresponding object.
• Group — The weight $w_{j}$ of the group $g_{j}$ is used for sampling each object $o_{i_{j}}$ from the group $g_{j}$.

### PoissonPoisson

Refer to the paper for details; supported only on GPU)

The weights of examples are i.i.d. sampled from the Poisson distribution with the parameter $-log(1-subsample)$ providing the expected number of examples with positive weights equal to the subsample parameter . If subsample is equal to 0.66, this approximates the classical bootstrap (sampling $n$ examples with repetitions).

Associated parameters:

subsample

Command-line version: --bagging-temperature

Sample rate for bagging.

This parameter can be used if one of the following bootstrap types is selected:

• Poisson
• Bernoulli
• MVS
sampling_unit

Command-line version: --sampling-unit

The sampling scheme.

Possible values:

• Object — The weight $w_{i}$ of the i-th object $o_{i}$ is used for sampling the corresponding object.
• Group — The weight $w_{j}$ of the group $g_{j}$ is used for sampling each object $o_{i_{j}}$ from the group $g_{j}$.

### NoNo

All training examples are used with equal weights.

Associated parameters:

sampling_unit

Command-line version: --sampling-unit

The sampling scheme.

Possible values:

• Object — The weight $w_{i}$ of the i-th object $o_{i}$ is used for sampling the corresponding object.
• Group — The weight $w_{j}$ of the group $g_{j}$ is used for sampling each object $o_{i_{j}}$ from the group $g_{j}$.

## Frequency of resampling and reweightingFrequency of resampling and reweighting

The frequency of resampling and reweighting is defined by the sampling_frequency parameter:

• PerTree — Before constructing each new tree

• PerTreeLevel — Before choosing each new split of a tree

Note

It is recommended to use MVS when speeding up is an important issue and regularization is not. It is usually the case when operating large data. For regularization, other options might be more appropriate.

#### Estimating Uncertainty for Massive Data StreamsEstimating Uncertainty for Massive Data Streams

N. Chamandy, O. Muralidharan, A. Najmi, and S. Naid, 2012

#### Stochastic gradient boostingStochastic gradient boosting

J. H. Friedman

Computational Statistics & Data Analysis, 38(4):367–378, 2002

#### Training Deep Models Faster with Robust, Approximate Importance SamplingTraining Deep Models Faster with Robust, Approximate Importance Sampling

T. B. Johnson and C. Guestrin

In Advances in Neural Information Processing Systems, pages 7276–7286, 2018.

#### Lightgbm: A highly efficient gradient boosting decision treeLightgbm: A highly efficient gradient boosting decision tree

G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu..

In Advances in Neural Information Processing Systems, pages 3146–3154, 2017.

#### CatBoost: unbiased boosting with categorical featuresCatBoost: unbiased boosting with categorical features

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, Andrey Gulin. NeurIPS, 2018

NeurIPS 2018 paper with explanation of Ordered boosting principles and ordered categorical features statistics.