Quantization
Before learning, the possible values of objects are divided into disjoint ranges (buckets) delimited by the threshold values (splits). The size of the quantization (the number of splits) is determined by the starting parameters (separately for numerical features and numbers obtained as a result of converting categorical features into numerical features).
Quantization is also used to split the label values when working with categorical features. А random subset of the dataset is used for this purpose on large datasets.
The table below shows the quantization modes provided in CatBoost.
Mode | How splits are chosen |
---|---|
Median | Include an approximately equal number of objects in every bucket. |
Uniform | Generate splits by dividing the |
UniformAndQuantiles | Combine the splits obtained in the following modes, after first halving the quantization size provided by the starting parameters for each of them:
|
MaxLogSum | Maximize the value of the following expression inside each bucket:
|
MinEntropy | Minimize the value of the following expression inside each bucket:
|
GreedyLogSum | Maximize the greedy approximation of the following expression inside every bucket:
|
Mode | How splits are chosen |
---|---|
Median | Include an approximately equal number of objects in every bucket. |
Uniform | Generate splits by dividing the |
UniformAndQuantiles | Combine the splits obtained in the following modes, after first halving the quantization size provided by the starting parameters for each of them:
|
MaxLogSum | Maximize the value of the following expression inside each bucket:
|
MinEntropy | Minimize the value of the following expression inside each bucket:
|
GreedyLogSum | Maximize the greedy approximation of the following expression inside every bucket:
|