Cross-validation

Purpose

Training can be launched in cross-validation mode. In this case, only the training dataset is required. This dataset is split, and the resulting folds are used as the learning and evaluation datasets. If the input dataset contains the GroupId column, all objects from one group are added to the same fold.

Each cross-validation run from the command-line interface launches one training out of N trainings in N-fold cross-validation.

Use one of the following methods to get aggregated N-fold cross-validation results:

  • Run the training in cross-validation mode from the command-line interface N times with different validation folds and aggregate results by hand.
  • Use the cv function of the Python package instead of the command-line version. It returns aggregated results out-of-the-box.

Execution format

catboost fit -f <file path> --cv <cv_type>:<fold_index>;<fold_count> [--cv-rand <value>] [other parameters]

For example:

catboost fit -f train.tsv --cv Classical:0;5

Options

-f

Description

The path to the dataset to cross-validate.

Default value

Required parameter (the path must be specified).

--cv

Description

Enable the cross-validation mode and specify the launching parameters.

Format:

<cv_type>:<fold_index>;<fold_count>

The following cross-validation types (cv_type) are supported:

Format: Classical:<fold_index>;<fold_count>

  • fold_index is the index of the fold to exclude from the learning data and use for evaluation (indexing starts from zero).

  • fold_count is the number of folds to split the input data into.

All folds, except the one indexed n, are used as the learning dataset. The fold indexed n is used as the validation dataset.

The inequality fold_index<fold_countfold\_index < fold\_count must be true.

The data is randomly shuffled before splitting.

Format: Inverted:<fold_index>;<fold_count>

  • fold_index is the index of the fold to use for learning (indexing starts from zero).
  • fold_count is the number of folds to split the input data into.

The fold indexed fold_index is used as the learning dataset. All other folds are used as the validation dataset.

The inequality fold_index<fold_countfold\_index < fold\_count must be true.

The data is randomly shuffled before splitting.

Example

Split the input dataset into 5 folds, use the one indexed 0 for validation and all others for training:

--cv Classical:0;5

Default value

Required parameter for cross-validation

--cv-rand

Description

Use this as the seed value for random permutation of the data.

The permutation is performed before splitting the data for cross-validation.

Each seed generates unique data splits.

It must be used with the --cv parameter type set to Classical or Inverted.

Default value

0

--cv-no-shuffle

Description

Do not shuffle the dataset before cross-validation.

Default value

Omitted

other parameters

Description

Any combination of the training parameters.

Default value

See the full list of default values in the Train a model section.

Usage examples

Launch the training three times with the same partition random seed and different validation folds to run a three-fold cross-validation:

catboost fit -f train.tsv --cv Classical:0;3 --cv-rand 17 --test-err-log fold_0_error.tsv
catboost fit -f train.tsv --cv Classical:1;3 --cv-rand 17 --test-err-log fold_1_error.tsv
catboost fit -f train.tsv --cv Classical:2;3 --cv-rand 17 --test-err-log fold_2_error.tsv

These trainings generate files with metric values, which can be aggregated manually.