Quantization
Before learning, the possible values of objects are divided into disjoint ranges (buckets) delimited by the threshold values (splits). The size of the quantization (the number of splits) is determined by the starting parameters (separately for numerical features and numbers obtained as a result of converting categorical features into numerical features).
Quantization is also used to split the label values when working with categorical features. А random subset of the dataset is used for this purpose on large datasets.
The table below shows the quantization modes provided in CatBoost.
Mode | How splits are chosen |
---|---|
Median | Include an approximately equal number of objects in every bucket. |
Uniform | Generate splits by dividing the [min_feature_value, max_feature_value] segment into subsegments of equal length. Absolute values of the feature are used in this case. |
UniformAndQuantiles | Combine the splits obtained in the following modes, after first halving the quantization size provided by the starting parameters for each of them: - Median. - Uniform. |
MaxLogSum | Maximize the value of the following expression inside each bucket: - — The number of distinct objects in the bucket. - — The number of times an object in the bucket is repeated. |
MinEntropy | Minimize the value of the following expression inside each bucket: - — The number of distinct objects in the bucket. - — The number of times an object in the bucket is repeated. |
GreedyLogSum | Maximize the greedy approximation of the following expression inside every bucket: - — The number of distinct objects in the bucket. - — The number of times an object in the bucket is repeated. |