Transforming embedding features to numerical features

CatBoost supports the following types of features:

  • Numerical. Examples are the height (182, 173), or any binary feature (0, 1).

  • Categorical (cat). Such features can take one of a limited number of possible values. These values are usually fixed. Examples are the musical genre (rock, indie, pop) and the musical style (dance, classical).

  • Text. Such features contain regular text (for example, Music to hear, why hear'st thou music sadly?).

  • Embedding. Such features contain arrays of fixed size of numeric values.

Embedding features are transformed to numerical. The transformation method generally includes the following stages:

  1. Loading and storing embedding features

    The embedding feature is loaded as a column. Every element in this column is an array of fixed size of numerical values.

    To load embedding features to CatBoost:

    • Specify the NumVector column type in the column descriptions file if the dataset is loaded from a file.
    • Use the embedding_features parameter in the Python package.
  2. Estimating numerical features

    Each embedding is transformed to the one or multiple numeric features.

    Supported methods for calculating numerical features:

    • Linear discriminant analysis

      • For classification the features will be calculated as Gaussian likelihood values for each class.
    • K Nearest Neighbors.

      • For classification the features will be counts of target classes among the found neighbors from the training set.
      • For regression the single feature will be the average target value among the found neighbors from the training set.
  3. Training

    Computed numerical features are passed to the regular CatBoost training algorithm.