How training is performed

The goal of training is to select the model yy, depending on a set of features xix_{i}, that best solves the given problem (regression, classification, or multiclassification) for any input object. This model is found by using a training dataset, which is a set of objects with known features and label values. Accuracy is checked on the validation dataset, which has data in the same format as in the training dataset, but it is only used for evaluating the quality of training (it is not used for training).

CatBoost is based on gradient boosted decision trees. During training, a set of decision trees is built consecutively. Each successive tree is built with reduced loss compared to the previous trees.

The number of trees is controlled by the starting parameters. To prevent overfitting, use the overfitting detector. When it is triggered, trees stop being built.

Building stages for a single tree:

  1. Preliminary calculation of splits.
  2. (Optional) Transforming categorical features to numerical features.
  3. (Optional) Transforming text features to numerical features.
  4. (Optional) Transforming embedding features to numerical features.
  5. Choosing the tree structure. This stage is affected by the set Bootstrap options.
  6. Calculating values in leaves.