Cross-validation

CatBoost allows to perform cross-validation on the given dataset.

Choose the implementation for more details.

Python package

Class

cv

Class purpose

Perform cross-validation on the dataset.

Command-line version

For the catboost fit command:

Purpose

Training can be launched in cross-validation mode. In this case, only the training dataset is required. This dataset is split, and the resulting folds are used as the learning and evaluation datasets. If the input dataset contains the GroupId column, all objects from one group are added to the same fold.

Each cross-validation run from the command-line interface launches one training out of N trainings in N-fold cross-validation.

Use one of the following methods to get aggregated N-fold cross-validation results:

  • Run the training in cross-validation mode from the command-line interface N times with different validation folds and aggregate results by hand.
  • Use theĀ cv function of the Python package instead of the command-line version. It returns aggregated results out-of-the-box.

Command keys

--cv

Key description

Enable the cross-validation mode and specify the launching parameters.

Format:

<cv_type>:<fold_index>;<fold_count>

The following cross-validation types (cv_type) are supported:

Classical

Format: Classical<fold_index>;<fold_count>

  • fold_index is the index of the fold to exclude from the learning data and use for evaluation (indexing starts from zero).

  • fold_count is the number of folds to split the input data into.

All folds, except the one indexed n, are used as the learning dataset. The fold indexed n is used as the validation dataset.

The inequality fold_index<fold_countfold\_index < fold\_count must be true.

The data is randomly shuffled before splitting.

Inverted

Format: Inverted<fold_index>;<fold_count>

  • fold_index is the index of the fold to use for learning (indexing starts from zero).
  • fold_count is the number of folds to split the input data into.

The fold indexed fold_index is used as the learning dataset. All other folds are used as the validation dataset.

The inequality fold_index<fold_countfold\_index < fold\_count must be true.

The data is randomly shuffled before splitting.

Example

Split the input dataset into 5 folds, use the one indexed 0 for validation and all others for training:

--cv Classical:0;5

--cv-rand

Purpose

Use this as the seed value for random permutation of the data.

The permutation is performed before splitting the data for cross-validation.

Each seed generates unique data splits.

It must be used with the --cv parameter type set to Classical or Inverted.