Categorical features

Attention. Do not use one-hot encoding during preprocessing. This affects both the training speed and the resulting quality.

CatBoost supports numerical, categorical and text features.

Categorical features are used to build new numeric features based on categorical features and their combinations. See the Transforming categorical features to numerical features section for details.

By default, CatBoost uses one-hot encoding for categorical features with a small amount of different values in most modes. It is not available if training is performed on CPU in Pairwise scoring mode. The default threshold for the number of unique values of the feature to be processed as one-hot encoded depends on various conditions, which are described in the table below.

Ctrs are not calculated for features that are used with one-hot encoding.

Some types of Ctrs require target data in the training dataset. Such Ctrs are not calculated if this data is not available. In this, case only one-hot encoded categorical features are used if training is performed on GPU (and the default value of unique values threshold for a categorical feature to be considered one-hot is increased according to this condition) and all categorical features are ignored if training is performed on CPU.

Use the following parameters to change the maximum number of unique values of categorical features for applying one-hot encoding:

Command-line version parameters Python parameters R parameters Description Default value
--one-hot-max-size one_hot_max_size one_hot_max_size

Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.

The default value depends on various conditions:

  • N/A if training is performed on CPU in Pairwise scoring mode
  • 255 if training is performed on GPU and the selected Ctr types require target data that is not available during the training
  • 10 if training is performed in Ranking mode
  • 2 if none of the conditions above is met
Command-line version parameters Python parameters R parameters Description Default value
--one-hot-max-size one_hot_max_size one_hot_max_size

Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.

The default value depends on various conditions:

  • N/A if training is performed on CPU in Pairwise scoring mode
  • 255 if training is performed on GPU and the selected Ctr types require target data that is not available during the training
  • 10 if training is performed in Ranking mode
  • 2 if none of the conditions above is met