Transforming embedding features to numerical features
CatBoost supports the following types of features:
-
Numerical. Examples are the height (
182
,173
), or any binary feature (0
,1
). -
Categorical (cat). Such features can take one of a limited number of possible values. These values are usually fixed. Examples are the musical genre (
rock
,indie
,pop
) and the musical style (dance
,classical
). -
Text. Such features contain regular text (for example,
Music to hear, why hear'st thou music sadly?
). -
Embedding. Such features contain arrays of fixed size of numeric values.
Embedding features are transformed to numerical. The transformation method generally includes the following stages:
-
Loading and storing embedding features
The embedding feature is loaded as a column. Every element in this column is an array of fixed size of numerical values.
To load embedding features to CatBoost:
- Specify the NumVector column type in the column descriptions file if the dataset is loaded from a file.
- Use the
embedding_features
parameter in the Python package.
-
Estimating numerical features
Each embedding is transformed to the one or multiple numeric features.
Supported methods for calculating numerical features:
-
- For classification the features will be calculated as Gaussian likelihood values for each class.
-
- For classification the features will be counts of target classes among the found neighbors from the training set.
- For regression the single feature will be the average target value among the found neighbors from the training set.
-
-
Training
Computed numerical features are passed to the regular CatBoost training algorithm.