Categorical features

Warning

Do not use one-hot encoding during preprocessing. This affects both the training speed and the resulting quality.

CatBoost supports numerical, categorical, text, and embeddings features.

Categorical features are used to build new numeric features based on categorical features and their combinations. See the Transforming categorical features to numerical features section for details.

By default, CatBoost uses one-hot encoding for categorical features with a small amount of different values in most modes. It is not available if training is performed on CPU in

Pairwise scoring

The following loss functions use Pairwise scoring:

  • YetiRankPairwise
  • PairLogitPairwise
  • QueryCrossEntropy

Pairwise scoring is slightly different from regular training on pairs, since pairs are generated only internally during the training for the corresponding metrics. One-hot encoding is not available for these loss functions.

mode. The default threshold for the number of unique values of the feature to be processed as one-hot encoded depends on various conditions, which are described in the table below.

Ctrs are not calculated for features that are used with one-hot encoding.

Some types of Ctrs require target data in the training dataset. Such Ctrs are not calculated if this data is not available. In this, case only one-hot encoded categorical features are used if training is performed on GPU (and the default value of unique values threshold for a categorical feature to be considered one-hot is increased according to this condition) and all categorical features are ignored if training is performed on CPU.

Use the following parameters to change the maximum number of unique values of categorical features for applying one-hot encoding:

Command-line version parameters: --one-hot-max-size

Python parameters: one_hot_max_size

R parameters: one_hot_max_size

Description

Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.

Default value

The default value depends on various conditions:

  • N/A if training is performed on CPU in Pairwise scoring mode

    Read more about Pairwise scoring

    The following loss functions use Pairwise scoring:

    • YetiRankPairwise
    • PairLogitPairwise
    • QueryCrossEntropy

    Pairwise scoring is slightly different from regular training on pairs, since pairs are generated only internally during the training for the corresponding metrics. One-hot encoding is not available for these loss functions.

  • 255 if training is performed on GPU and the selected Ctr types require target data that is not available during the training

  • 10 if training is performed in Ranking mode

  • 2 if none of the conditions above is met