Quantization

Before learning, the possible values of objects are divided into disjoint ranges (buckets) delimited by the threshold values (splits). The size of the quantization (the number of splits) is determined by the starting parameters (separately for numerical features and numbers obtained as a result of converting categorical features into numerical features).

Quantization is also used to split the label values when working with categorical features. А random subset of the dataset is used for this purpose on large datasets.

The table below shows the quantization modes provided in CatBoost.

Mode How splits are chosen
Median Include an approximately equal number of objects in every bucket.
Uniform Generate splits by dividing the [min_feature_value, max_feature_value] segment into subsegments of equal length. Absolute values of the feature are used in this case.
UniformAndQuantiles Combine the splits obtained in the following modes, after first halving the quantization size provided by the starting parameters for each of them:
- Median.
- Uniform.
MaxLogSum Maximize the value of the following expression inside each bucket:
i=1nlog(weight),where\sum\limits_{i=1}^{n}\log(weight){ , where}
- nn — The number of distinct objects in the bucket.
- weightweight — The number of times an object in the bucket is repeated.
MinEntropy Minimize the value of the following expression inside each bucket:
i=1nweightlog(weight),where\sum \limits_{i=1}^{n} weight \cdot log (weight) { , where}
- nn — The number of distinct objects in the bucket.
- weightweight — The number of times an object in the bucket is repeated.
GreedyLogSum Maximize the greedy approximation of the following expression inside every bucket:
i=1nlog(weight),where\sum\limits_{i=1}^{n}\log(weight){ , where}
- nn — The number of distinct objects in the bucket.
- weightweight — The number of times an object in the bucket is repeated.