Embeddings features

CatBoost supports numerical, categorical, text, and embeddings features.

Embedding features are used to build some new numeric features.
At the moment, we support two types of such derived numerical features. The first one uses Linear Discriminant Analysis to make a projection to lower dimension space. The second one uses the nearest neighbor search to calculate the number of close-by embeddings in every class.

We do not use coordinates of embedding features in our models. If you think that they could improve the quality of a model, you can add them as numerical features together with embedding ones.

Even though every vector feature can be used in a model, we optimized performance for:

  • Vectors with dimensions in the order of several hundreds.
  • Datasets with normally distributed classes.

Choose the implementation for details on the methods and/or parameters used that are required to start using embeddings features.

Python package

Class / method

Parameters

embedding_features

A one-dimensional array of embeddings columns indices (specified as integers) or names (specified as strings).

Use only if the data parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).

If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data parameter.

Command-line version binary

Specify NumVector for embedding features' columns in the column description file when they are present in the input datasets.