Transforming embedding features to numerical features
CatBoost supports the following types of features:
-
Numerical. Values of such features can be real numbers, positive and negative infinity and
NaN
(the latter represents missing values). Examples are the height (182
,173
), or any binary feature (0
,1
). -
Categorical (cat). Such features can take one of a limited number of possible values. These values are usually fixed. Examples are the musical genre (
rock
,indie
,pop
) and the musical style (dance
,classical
). -
Text. Such features contain regular text (for example,
Music to hear, why hear'st thou music sadly?
). -
Embedding. Such features contain arrays of fixed size of numeric values.
Embedding features are transformed to numerical. The transformation method generally includes the following stages:
-
Loading and storing embedding features
The embedding feature is loaded as a column. Every element in this column is an array of fixed size of numerical values.
To load embedding features to CatBoost:
- Specify the NumVector column type in the column descriptions file if the dataset is loaded from a file.
- Use the
embedding_features
parameter in the Python package.
-
Estimating numerical features
Each embedding is transformed to the one or multiple numeric features.
Supported methods for calculating numerical features:
-
- For classification the features will be calculated as Gaussian likelihood values for each class.
-
- For classification the features will be counts of target classes among the found neighbors from the training set.
- For regression the single feature will be the average target value among the found neighbors from the training set.
-
-
Training
Computed numerical features are passed to the regular CatBoost training algorithm.