Cross-validation
CatBoost allows to perform cross-validation on the given dataset.
Choose the implementation for more details.
Python package
Class
Class purpose
Perform cross-validation on the dataset.
Command-line version
For the catboost fit command:
Purpose
Training can be launched in cross-validation mode. In this case, only the training dataset is required. This dataset is split, and the resulting folds are used as the learning and evaluation datasets. If the input dataset contains the GroupId column, all objects from one group are added to the same fold.
Each cross-validation run from the command-line interface launches one training out of N trainings in N-fold cross-validation.
Use one of the following methods to get aggregated N-fold cross-validation results:
- Run the training in cross-validation mode from the command-line interface N times with different validation folds and aggregate results by hand.
- Use theĀ cv function of the Python package instead of the command-line version. It returns aggregated results out-of-the-box.
Command keys
--cv
Key description
Enable the cross-validation mode and specify the launching parameters.
Format:
<cv_type>:<fold_index>;<fold_count>
The following cross-validation types (cv_type
) are supported:
Classical
Format: Classical<fold_index>;<fold_count>
-
fold_index
is the index of the fold to exclude from the learning data and use for evaluation (indexing starts from zero). -
fold_count
is the number of folds to split the input data into.
All folds, except the one indexed n
, are used as the learning dataset. The fold indexed n
is used as the validation dataset.
The inequality must be true.
The data is randomly shuffled before splitting.
Inverted
Format: Inverted<fold_index>;<fold_count>
fold_index
is the index of the fold to use for learning (indexing starts from zero).-
fold_count
is the number of folds to split the input data into.
The fold indexed fold_index
is used as the learning dataset. All other folds are used as the validation dataset.
The inequality must be true.
The data is randomly shuffled before splitting.
Example
Split the input dataset into 5 folds, use the one indexed 0 for validation and all others for training:
--cv Classical:0;5
--cv-rand
Purpose
Use this as the seed value for random permutation of the data.
The permutation is performed before splitting the data for cross-validation.
Each seed generates unique data splits.
It must be used with the --cv
parameter type set to Classical or Inverted.