Cross-validation
Purpose
Training can be launched in cross-validation mode. In this case, only the training dataset is required. This dataset is split, and the resulting folds are used as the learning and evaluation datasets. If the input dataset contains the GroupId column, all objects from one group are added to the same fold.
Each cross-validation run from the command-line interface launches one training out of N trainings in N-fold cross-validation.
Use one of the following methods to get aggregated N-fold cross-validation results:
- Run the training in cross-validation mode from the command-line interface N times with different validation folds and aggregate results by hand.
- Use the cv function of the Python package instead of the command-line version. It returns aggregated results out-of-the-box.
Execution format
catboost fit -f <file path> --cv <cv_type>:<fold_index>;<fold_count> [--cv-rand <value>] [other parameters]
For example:
catboost fit -f train.tsv --cv Classical:0;5
Options
-f
Description
The path to the dataset to cross-validate.
Default value
Required parameter (the path must be specified).
--cv
Description
Enable the cross-validation mode and specify the launching parameters.
Format:
<cv_type>:<fold_index>;<fold_count>
The following cross-validation types (cv_type
) are supported:
Format: Classical:<fold_index>;<fold_count>
-
fold_index
is the index of the fold to exclude from the learning data and use for evaluation (indexing starts from zero). -
fold_count
is the number of folds to split the input data into.
All folds, except the one indexed n
, are used as the learning dataset. The fold indexed n
is used as the validation dataset.
The inequality must be true.
The data is randomly shuffled before splitting.
Format: Inverted:<fold_index>;<fold_count>
fold_index
is the index of the fold to use for learning (indexing starts from zero).-
fold_count
is the number of folds to split the input data into.
The fold indexed fold_index
is used as the learning dataset. All other folds are used as the validation dataset.
The inequality must be true.
The data is randomly shuffled before splitting.
Example
Split the input dataset into 5 folds, use the one indexed 0 for validation and all others for training:
--cv Classical:0;5
Default value
Required parameter for cross-validation
--cv-rand
Description
Use this as the seed value for random permutation of the data.
The permutation is performed before splitting the data for cross-validation.
Each seed generates unique data splits.
It must be used with the --cv
parameter type set to Classical or Inverted.
Default value
0
--cv-no-shuffle
Description
Do not shuffle the dataset before cross-validation.
Default value
Omitted
other parameters
Description
Any combination of the training parameters.
Default value
See the full list of default values in the Train a model section.
Usage examples
Launch the training three times with the same partition random seed and different validation folds to run a three-fold cross-validation:
catboost fit -f train.tsv --cv Classical:0;3 --cv-rand 17 --test-err-log fold_0_error.tsv
catboost fit -f train.tsv --cv Classical:1;3 --cv-rand 17 --test-err-log fold_1_error.tsv
catboost fit -f train.tsv --cv Classical:2;3 --cv-rand 17 --test-err-log fold_2_error.tsv
These trainings generate files with metric values, which can be aggregated manually.