Quantization
Before learning, the possible values for each numerical feature (for both specified directly in the input data and obtained as results of internal processing, see the documentation about converting categorical, text and embedding features for details) are divided into disjoint ranges (buckets) delimited by the threshold values (splits). The size of the quantization (the number of splits) is determined by the starting parameters (separately for numerical features specified directly in the input data and obtained as results of internal processing, see the documentation about converting categorical, text and embedding features for details).
Quantization is also used to split the label values when working with categorical features. А random subset of the dataset is used for this purpose on large datasets.
The table below shows the quantization modes provided in CatBoost.
| Mode | How splits are chosen |
|---|---|
| Median | Include an approximately equal number of objects in every bucket. |
| Uniform | Generate splits by dividing the [min_feature_value, max_feature_value] segment into subsegments of equal length. Absolute values of the feature are used in this case. |
| UniformAndQuantiles | Combine the splits obtained in the following modes, after first halving the quantization size provided by the starting parameters for each of them: - Median. - Uniform. |
| MaxLogSum | Maximize the value of the following expression inside each bucket: - — The number of distinct objects in the bucket. - — The number of times an object in the bucket is repeated. |
| MinEntropy | Minimize the value of the following expression inside each bucket: - — The number of distinct objects in the bucket. - — The number of times an object in the bucket is repeated. |
| GreedyLogSum | Maximize the greedy approximation of the following expression inside every bucket: - — The number of distinct objects in the bucket. - — The number of times an object in the bucket is repeated. |