Cross-validation

Purpose

Training can be launched in cross-validation mode. In this case, only the training dataset is required. This dataset is split, and the resulting folds are used as the learning and evaluation datasets. If the input dataset contains the GroupId column, all objects from one group are added to the same fold.

Each cross-validation run from the command-line interface launches one training out of N trainings in N-fold cross-validation.

Use one of the following methods to get aggregated N-fold cross-validation results:
  • Run the training in cross-validation mode from the command-line interface N times with different validation folds and aggregate results by hand.
  • Use the cv function of the Python package instead of the command-line version. It returns aggregated results out-of-the-box.

Execution format

catboost fit -f <file path> --cv <cv_type>:<fold_index>;<fold_count> [--cv-rand <value>] [other parameters]
For example:
catboost fit -f train.tsv --cv Classical:0;5

Options

Option Description Default value
-f

The path to the dataset to cross-validate.

Required parameter (the path must be specified).

--cv

Enable the cross-validation mode and specify the launching parameters.

Format:
<cv_type>:<fold_index>;<fold_count>
The following cross-validation types (cv_type) are supported:
Classical

Format: Classical:<fold_index>;<fold_count>

  • fold_index is the index of the fold to exclude from the learning data and use for evaluation (indexing starts from zero).

  • fold_count is the number of folds to split the input data into.

All folds, except the one indexed n, are used as the learning dataset. The fold indexed n is used as the validation dataset.

The inequality must be true.

The data is randomly shuffled before splitting.

Inverted
Format: Inverted:<fold_index>;<fold_count>
  • fold_index is the index of the fold to use for learning (indexing starts from zero).
  • fold_count is the number of folds to split the input data into.

The fold indexed fold_index is used as the learning dataset. All other folds are used as the validation dataset.

The inequality must be true.

The data is randomly shuffled before splitting.

Example
Split the input dataset into 5 folds, use the one indexed 0 for validation and all others for training:
--cv Classical:0;5
Required parameter for cross-validation

--cv-rand

Use this as the seed value for random permutation of the data.

The permutation is performed before splitting the data for cross-validation.

Each seed generates unique data splits.

It must be used with the --cv parameter type set to Classical or Inverted.

0
--cv-no-shuffle Do not shuffle the dataset before cross-validation. Omitted
other parameters Any combination of the training parameters. See the full list of default values in the Train a model section.
Option Description Default value
-f

The path to the dataset to cross-validate.

Required parameter (the path must be specified).

--cv

Enable the cross-validation mode and specify the launching parameters.

Format:
<cv_type>:<fold_index>;<fold_count>
The following cross-validation types (cv_type) are supported:
Classical

Format: Classical:<fold_index>;<fold_count>

  • fold_index is the index of the fold to exclude from the learning data and use for evaluation (indexing starts from zero).

  • fold_count is the number of folds to split the input data into.

All folds, except the one indexed n, are used as the learning dataset. The fold indexed n is used as the validation dataset.

The inequality must be true.

The data is randomly shuffled before splitting.

Inverted
Format: Inverted:<fold_index>;<fold_count>
  • fold_index is the index of the fold to use for learning (indexing starts from zero).
  • fold_count is the number of folds to split the input data into.

The fold indexed fold_index is used as the learning dataset. All other folds are used as the validation dataset.

The inequality must be true.

The data is randomly shuffled before splitting.

Example
Split the input dataset into 5 folds, use the one indexed 0 for validation and all others for training:
--cv Classical:0;5
Required parameter for cross-validation

--cv-rand

Use this as the seed value for random permutation of the data.

The permutation is performed before splitting the data for cross-validation.

Each seed generates unique data splits.

It must be used with the --cv parameter type set to Classical or Inverted.

0
--cv-no-shuffle Do not shuffle the dataset before cross-validation. Omitted
other parameters Any combination of the training parameters. See the full list of default values in the Train a model section.

Usage examples

Launch the training three times with the same partition random seed and different validation folds to run a three-fold cross-validation:
catboost fit -f train.tsv --cv Classical:0;3 --cv-rand 17 --test-err-log fold_0_error.tsv
catboost fit -f train.tsv --cv Classical:1;3 --cv-rand 17 --test-err-log fold_1_error.tsv
catboost fit -f train.tsv --cv Classical:2;3 --cv-rand 17 --test-err-log fold_2_error.tsv

These trainings generate files with metric values, which can be aggregated manually.