Transforming categorical features to numerical features
CatBoost supports the following types of features:
-
Numerical. Examples are the height (
182
,173
), or any binary feature (0
,1
). -
Categorical (cat). Such features can take one of a limited number of possible values. These values are usually fixed. Examples are the musical genre (
rock
,indie
,pop
) and the musical style (dance
,classical
). -
Text. Such features contain regular text (for example,
Music to hear, why hear'st thou music sadly?
). -
Embedding. Such features contain arrays of fixed size of numeric values.
Before each split is selected in the tree (see Choosing the tree structure), categorical features are transformed to numerical. This is done using various statistics on combinations of categorical features and combinations of categorical and numerical features.
The method of transforming categorical features to numerical generally includes the following stages:
-
Permutating the set of input objects in a random order.
-
Converting the label value from a floating point to an integer.
The method depends on the machine learning problem being solved (which is determined by the selected loss function).
Problem How transformation is performed Regression Quantization is performed on the label value. The mode and number of buckets () are set in the starting parameters. All values located inside a single bucket are assigned a label value class – an integer in the range defined by the formula: <bucket ID – 1>
.Classification Possible values for label value are 0
(doesn't belong to the specified target class) and "1" (belongs to the specified target class).Multiclassification The label values are integer identifiers of target classes (starting from "0"). -
Transforming categorical features to numerical features.
The method is determined by the starting parameters.
Type : Borders
Formula:
Calculating ctr for the i-th bucket ():
countInClass
is how many times the label value exceeded for objects with the current categorical feature value. It only counts objects that already have this value calculated (calculations are made in the order of the objects after shuffling).totalCount
is the total number of objects (up to the current one) that have a feature value matching the current one.prior
is a number (constant) defined by the starting parameters.
Type : Buckets
Formula:
Calculating ctr for the i-th bucket (, creates features):
countInClass
is how many times the label value was equal to for objects with the current categorical feature value. It only counts objects that already have this value calculated (calculations are made in the order of the objects after shuffling).totalCount
is the total number of objects (up to the current one) that have a feature value matching the current one.prior
is a number (constant) defined by the starting parameters.
Type : BinarizedTargetMeanValue
Formula:
How ctr is calculated:
countInClass
is the sum of the label values divided by the maximum label value integer ().totalCount
is the total number of objects that have a feature value matching the current one.prior
is a number (constant) defined by the starting parameters.
Type : Counter
Formula:
How ctr is calculated for the training dataset:
curCount
is the total number of objects in the training dataset with the current categorical feature value.maxCount
the number of objects in the training dataset with the most frequent feature value.prior
is a number (constant) defined by the starting parameters.
How ctr is calculated for the validation dataset:
-
curCount
computing principles depend on the chosen calculation method:- Full — The sum of the total number of objects in the training dataset with the current categorical feature value and the number of objects in the validation dataset with the current categorical feature value.
- SkipTest — The total number of objects in the training dataset with the current categorical feature value
-
maxCount
is the number of objects with the most frequent feature value in one of the combinations of the following sets depending on the chosen calculation method:- Full — The training and the validation datasets.
- SkipTest — The training dataset.
-
prior
is a number (constant) defined by the starting parameters.
Note
This ctr does not depend on the label value.
As a result, each categorical feature values or feature combination value is assigned a numerical feature.
Example of aggregating multiple features
Assume that the objects in the training set belong to two categorical features: the musical genre (rock
, indie
) and the musical style (dance
, classical
). These features can occur in different combinations. CatBoost can create a new feature that is a combination of those listed (dance rock
, classic rock
, dance indie
, or indie classical
). Any number of features can be combined.
Transforming categorical features to numerical features in classification
-
CatBoost accepts a set of object properties and model values as input.
The table below shows what the results of this stage look like.
Object # ... Function value 1 2 40 ... rock 1 2 3 55 ... indie 0 3 5 34 ... pop 1 4 2 45 ... rock 0 5 4 53 ... rock 0 6 2 48 ... indie 1 7 5 42 ... rock 1 ... -
The rows in the input file are randomly shuffled several times. Multiple random permutations are generated.
The table below shows what the results of this stage look like.
Object # ... Function value 1 4 53 ... rock 0 2 3 55 ... indie 0 3 2 40 ... rock 1 4 5 42 ... rock 1 5 5 34 ... pop 1 6 2 48 ... indie 1 7 2 45 ... rock 0 ... -
All categorical feature values are transformed to numerical using the following formula:
- is how many times the label value was equal to
1
for objects with the current categorical feature value. - is the preliminary value for the numerator. It is determined by the starting parameters.
- is the total number of objects (up to the current one) that have a categorical feature value matching the current one.
Note
These values are calculated individually for each object using data from previous objects.
In the example with musical genres, accepts the values
rock
,pop
, andindie
, and prior is set to 0.05.The table below shows what the results of this stage look like.
Object # ... Function value 1 4 53 ... 0,05 0 2 3 55 ... 0,05 0 3 2 40 ... 0,025 1 4 5 42 ... 0,35 1 5 5 34 ... 0,05 1 6 2 48 ... 0,025 1 7 2 45 ... 0,5125 0 ... - is how many times the label value was equal to
One-hot encoding is also supported. Use one of the following training parameters to enable it.
Command-line version parameter | Python parameter | R parameter | Description |
---|---|---|---|
--one-hot-max-size |
one_hot_max_size |
one_hot_max_size |
Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features. See details. |