FAQ

Why is the metric value on the validation dataset sometimes better than the one on the training dataset?

This happens because auto-generated numerical features that are based on categorical features are calculated differently for the training and validation datasets:

  • Training dataset: the feature is calculated differently for every object in the dataset. For each i-th object the feature is calculated based on data from the first i-1 objects (the first i-1 objects in some random permutation).
  • Validation dataset: the feature is calculated equally for every object in the dataset. For each object the feature is calculated using data from all objects of the training dataset.

When the feature is calculated on data from all objects of the training dataset it uses more information than the feature, that is calculated only on a part of the dataset. For this reason this feature is more powerful. A more powerful feature results in a better loss value.

Thus, the loss value on the validation dataset might be better than the loss value for the training dataset, because the validation dataset has more powerful features.

Details of the algorithm and the rationale behind this solution

CatBoost: unbiased boosting with categorical features

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, Andrey Gulin. NeurIPS, 2018

NeurIPS 2018 paper with explanation of Ordered boosting principles and ordered categorical features statistics.

CatBoost: gradient boosting with categorical features support

Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin. Workshop on ML Systems at NIPS 2017

A paper explaining the CatBoost working principles: how it handles categorical features, how it fights overfitting, how GPU training and fast formula applier are implemented.

Ordered boosting and categorical features processing in CatBoost short overview

Why can metric values on the training dataset that are output during training, be different from ones output when using model predictions?

This happens because auto-generated numerical features that are based on categorical features are calculated differently when training and applying the model.

During the training the feature is calculated differently for every object in the training dataset. For each i-th object the feature is calculated based on data from the first i-1 objects (the first i-1 objects in some random permutation). During the prediction the same feature is calculated using data from all objects from the training dataset.

When the feature is calculated on data from all objects of the training dataset it uses more information than the feature, that is calculated only on a part of the dataset. For this reason this feature is more powerful. A more powerful feature results in a better loss value.

Thus, the loss value calculated during for the prediction might be better than the one that is printed out during the training even though the same dataset is used.

Details of the algorithm and the rationale behind this solution

CatBoost: unbiased boosting with categorical features

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, Andrey Gulin. NeurIPS, 2018

NeurIPS 2018 paper with explanation of Ordered boosting principles and ordered categorical features statistics.

CatBoost: gradient boosting with categorical features support

Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin. Workshop on ML Systems at NIPS 2017

A paper explaining the CatBoost working principles: how it handles categorical features, how it fights overfitting, how GPU training and fast formula applier are implemented.

Ordered boosting and categorical features processing in CatBoost short overview

How should weights or baseline be specified for the validation dataset?

Use the Pool class.

An example of specifying weights:

from catboost import CatBoostClassifier, Pool

train_data = Pool(
    data=[[1, 4, 5, 6],
          [4, 5, 6, 7],
          [30, 40, 50, 60]],
    label=[1, 1, -1],
    weight=[0.1, 0.2, 0.3]
)

eval_data = Pool(
    data=[[1, 4, 5, 6],
          [4, 5, 6, 7],
          [30, 40, 50, 60]],
    label=[1, 0, -1],
    weight=[0.7, 0.1, 0.3]
)

model = CatBoostClassifier(iterations=10)

model.fit(X=train_data, eval_set=eval_data)

Why is it forbidden to use float values and nan values for categorical features?

The algorithm should work identically regardless of the input data format (file or matrix). If the dataset is read from a file all values of categorical features are treated as strings. To treat it the same way when training from matrix, a unique string representation of each feature value is required. There is no unique string representation for floating point values and for nan values.

Floating point values

If floating point categorical features are allowed the following problem arises.

The feature f is categorical and takes values 1 and 2.

A matrix is used for the training. The column that corresponds to the feature f contains values 1.0 and 2.0.

Each categorical feature value is converted to a string during the training to calculate the corresponding hash value. 1.0 is converted to the string 1.0 , and 2.0 is converted to the string 2.0.

After the training the prediction is performed on file.

The column with the feature f contains values 1 and 2.

During the prediction, the hash value of the string 1 is calculated. This value is not equal to the hash value of the string 1.0.

Thus, the model doesn't collate this value with the one in the training dataset, therefore the prediction is incorrect.

None categorical feature

The feature f is categorical and takes the value None for some object Obj.

A matrix is used for the training. The column that contains the value of the feature f for the object Obj contains the value None.

Each categorical feature value is converted to a string during the training to calculate the corresponding hash value. The None value is converted to the string None.

After the training the prediction is performed on file. The column with the feature _f _ contains the value N/A, which would be parsed as None if it was read to a pandas.DataFrame before the training.

The hash value of the string N/A is calculated during the prediction. This value is not equal to the hash value of the string None.

Thus, the model doesn't collate this value with the one in the training dataset, therefore the prediction is incorrect.

Since it is not possible to guarantee that the string representation of floating point and None values are the same when reading data from a file or converting the value to a string in Python or any other language, it is required to use strings instead of floating point and None values.

How to use GridSearchCV and RandomSearchCV from sklearn with categorical features?

Use the cat_featuresparameter when constructing the model (CatBoost, CatBoostRegressor or CatBoostClassifier).

Example:

model = catboost.CatBoostRegressor(cat_features=[0,1,2]) grid_search =
            sklearn.model_selection.GridSearchCV(model,
    param_grid)

How to understand which categorical feature combinations have been selected during the training?

Use the InternalFeatureImportance to familiarize with the resulting combinations. Generate this file from the command-line by setting the --fstr-type parameter to InternalFeatureImportance.

The format of the resulting file is described here.

The default feature importances are calculated in accordance with the following principles:

  1. Importances of all numerical features are calculated. Some of the numerical features are auto-generated based on categorical features and feature combinations.
  2. These importances are shared between initial features. If a numerical feature is auto-generated based on a feature combination, then the importance value is shared equally between the combination participants.

The file that is generate in the InternalFeatureImportance mode contains the description of initial numerical features and their importances.

How to overcome the Out of memoryerror when training on GPU?

  • Set the --boosting-type for the Command-line version parameter to Plain. It is set to Ordered by default for datasets with less then 50 thousand objects. TheOrdered scheme requires a lot of memory.
  • Set the --max-ctr-complexity for the Command-line version parameter to either 1 or 2 if the dataset has categorical features.
  • Decrease the value of the --gpu-ram-part for the Command-line version parameter.
  • Set the --gpu-cat-features-storage for the Command-line version parameter to CpuPinnedMemory.
  • Check that the dataset fits in GPU memory. The quantized version of the dataset is loaded into GPU memory. This version is much smaller than the initial dataset. But it can exceed the available memory if the dataset is large enough.
  • Decrease the depth value, if it is greater than 10. Each tree contains 2n2^{n} leaves if the depth is set to nn, because CatBoost builds full symmetric trees by default. The recommended depth is 6, which works well in most cases. In rare cases it's useful to increase the depth value up to 10.

How to reduce the size of the final model?

If the dataset contains categorical features with many different values, the size of the resulting model may be huge. Try the following approaches to reduce the size of the resulting model:

  • Decrease the --max-ctr-complexity for the Command-line version to either 1 or 2

  • For training on CPU:

    • Increase the value of the --model-size-reg for the Command-line version parameter.
    • Set the value of the --ctr-leaf-count-limit for the Command-line version parameter. The number of different category values is not limited be default.
  • Decrease the value of the --iterations for the Command-line version parameter and increase the value of the --learning-rate for the Command-line version parameter.

  • Remove categorical features that have a small feature importance from the training dataset.

How to get the model with best parameters from the python cv function?

It is not possible. The CatBoost cv function is intended for cross-validation only, it can not be used for tuning parameter.

The dataset is split into N folds. N–1 folds are used for training and one fold is used for model performance estimation. At each iteration, the model is evaluated on all N folds independently. The average score with standard deviation is computed for each iteration.

The only parameter that can be selected based on cross-validation is the number of iterations. Select the best iteration based on the information of the cv results and train the final model with this number of iterations.

What are the differences between training on CPU and GPU?

  • The default value of the --border-count for the Command-line version parameter depends on the processing unit type and other parameters:

    • CPU: 254
    • GPU in PairLogitPairwise and YetiRankPairwise modes: 32
    • GPU in all other modes: 128
  • Training on CPU has the model_size_reg set by default. It decreases the size of models that have categorical features. This option is turned off for training on GPU.

  • Training on GPU is non-deterministic, because the order of floating point summations is non-deterministic in this implementation.

  • The following parameters are not supported if training is performed on GPU: --ctr-leaf-count-limit for the Command-line version, --monotone-constraints for the Command-line version.

  • The default value of the --leaf-estimation-method for the Quantile and MAE loss functions is Exact on CPU and GPU.

  • Combinations of categorical features are not supported for the following modes if training is performed on GPU: MultiClass and MultiClassOneVsAll. The default value of the --max-ctr-complexity for the Command-line version parameter for such cases is set to 1.

  • The default values for the following parameters depend on the processing unit type:

    • --bootstrap-type for the Command-line version:
    • When the objective parameter is QueryCrossEntropy, YetiRankPairwise, PairLogitPairwise and the bagging_temperature parameter is not set: Bernoulli with the subsample parameter set to 0.5
    • Not MultiClass and MultiClassOneVsAll, task_type = CPU and sampling_unit = Object: MVS with the subsample parameter set to 0.8.
    • Otherwise: Bayesian.
    • --boosting-type for the Command-line version:

    CPU

    Plain

    GPU

    • Any number of objects, MultiClass or MultiClassOneVsAll mode: Plain
    • More than 50 thousand objects, any mode: Plain
    • Less than or equal to 50 thousand objects, any mode but MultiClass or MultiClassOneVsAll: Ordered
  • --model-size-reg for the Command-line version:

    Feature combinations are regularized more aggressively on GPU.

    CPU

    The cost of a combination is equal to the number of different feature values in this combinations that are present in the training dataset.

    GPU

    The cost of a combination is equal to number of all possible different values of this combination. For example, if the combination contains two categorical features (c1 and c2), the cost is calculated as number_of_categories_in_c1number_of_categories_in_c2number\_of\_categories\_in\_c1 \cdot number\_of\_categories\_in\_c2, even though many of the values from this combination might not be present in the dataset.

    Refer to the Model size regularization coefficient section for details on the calculation principles.

Does CatBoost require preprocessing of missing values?

CatBoost can handle missing values internally. None values should be used for missing value representation.

If the dataset is read from a file, missing values can be represented as strings like N/A, NAN, None, empty string and the like.

Refer to the Missing values processing section for details.