Transforming categorical features to numerical features

CatBoost supports the following types of features:

  • Numerical. Examples are the height (182, 173), or any binary feature (0, 1).

  • Categorical (cat). Such features can take one of a limited number of possible values. These values are usually fixed. Examples are the musical genre (rock, indie, pop) and the musical style (dance, classical).

  • Text. Such features contain regular text (for example, Music to hear, why hear'st thou music sadly?).

  • Embedding. Such features contain arrays of fixed size of numeric values.

Before each split is selected in the tree (see Choosing the tree structure), categorical features are transformed to numerical. This is done using various statistics on combinations of categorical features and combinations of categorical and numerical features.

The method of transforming categorical features to numerical generally includes the following stages:

  1. Permutating the set of input objects in a random order.

  2. Converting the label value from a floating point to an integer.

    The method depends on the machine learning problem being solved (which is determined by the selected loss function).

    Problem How transformation is performed
    Regression Quantization is performed on the label value. The mode and number of buckets (k+1k+1) are set in the starting parameters. All values located inside a single bucket are assigned a label value class – an integer in the range [0;k][0;k] defined by the formula: <bucket ID – 1>.
    Classification Possible values for label value are 0 (doesn't belong to the specified target class) and "1" (belongs to the specified target class).
    Multiclassification The label values are integer identifiers of target classes (starting from "0").
  3. Transforming categorical features to numerical features.

    The method is determined by the starting parameters.

    Type : Borders

    Formula:

    Calculating ctr for the i-th bucket (i[0;k1]i\in[0; k-1]):

    ctri=countInClass+priortotalCount+1,wherectr_{i} = \frac{countInClass + prior}{totalCount + 1} { , where}

    • countInClass is how many times the label value exceeded ii for objects with the current categorical feature value. It only counts objects that already have this value calculated (calculations are made in the order of the objects after shuffling).
    • totalCount is the total number of objects (up to the current one) that have a feature value matching the current one.
    • prior is a number (constant) defined by the starting parameters.

    Type : Buckets

    Formula:

    Calculating ctr for the i-th bucket (i[0;k]i\in[0; k], creates k+1k+1 buckets):

    ctri=countInClass+priortotalCount+1,wherectr_{i} = \frac{countInClass + prior}{totalCount + 1} { , where}

    • countInClass is how many times the label value was equal to ii for objects with the current categorical feature value. It only counts objects that already have this value calculated (calculations are made in the order of the objects after shuffling).
    • totalCount is the total number of objects (up to the current one) that have a feature value matching the current one.
    • prior is a number (constant) defined by the starting parameters.

    Type : BinarizedTargetMeanValue

    Formula:

    How ctr is calculated:

    ctr=countInClass+priortotalCount+1,wherectr = \frac{countInClass + prior}{totalCount + 1} { , where}

    • countInClass is the ratio of the sum of the label value integers for this categorical feature to the maximum label value integer (kk).
    • totalCount is the total number of objects that have a feature value matching the current one.
    • prior is a number (constant) defined by the starting parameters.

    Type : Counter

    Formula:
    How ctr is calculated for the training dataset:
    ctr=curCount+priormaxCount+1,wherectr = \frac{curCount + prior}{maxCount + 1} { , where}

    • curCount is the total number of objects in the training dataset with the current categorical feature value.
    • maxCount the number of objects in the training dataset with the most frequent feature value.
    • prior is a number (constant) defined by the starting parameters.

    How ctr is calculated for the validation dataset:
    ctr=curCount+priormaxCount+1,wherectr = \frac{curCount + prior}{maxCount + 1} { , where}

    • curCount computing principles depend on the chosen calculation method:

      • Full — The sum of the total number of objects in the training dataset with the current categorical feature value and the number of objects in the validation dataset with the current categorical feature value.
      • SkipTest — The total number of objects in the training dataset with the current categorical feature value
    • maxCount is the number of objects with the most frequent feature value in one of the combinations of the following sets depending on the chosen calculation method:

      • Full — The training and the validation datasets.
      • SkipTest — The training dataset.
    • prior is a number (constant) defined by the starting parameters.

    Note

    This ctr does not depend on the label value.

As a result, each categorical feature values or feature combination value is assigned a numerical feature.

Example of aggregating multiple features

Assume that the objects in the training set belong to two categorical features: the musical genre (rock, indie) and the musical style (dance, classical). These features can occur in different combinations. CatBoost can create a new feature that is a combination of those listed (dance rock, classic rock, dance indie, or indie classical). Any number of features can be combined.

Transforming categorical features to numerical features in classification

  1. CatBoost accepts a set of object properties and model values as input.

    The table below shows what the results of this stage look like.

    Object # f1f_{1} f2f_{2} ... fnf_{n} Function value
    1 2 40 ... rock 1
    2 3 55 ... indie 0
    3 5 34 ... pop 1
    4 2 45 ... rock 0
    5 4 53 ... rock 0
    6 2 48 ... indie 1
    7 5 42 ... rock 1
    ...
  2. The rows in the input file are randomly shuffled several times. Multiple random permutations are generated.

    The table below shows what the results of this stage look like.

    Object # f1f_{1} f2f_{2} ... fnf_{n} Function value
    1 4 53 ... rock 0
    2 3 55 ... indie 0
    3 2 40 ... rock 1
    4 5 42 ... rock 1
    5 5 34 ... pop 1
    6 2 48 ... indie 1
    7 2 45 ... rock 0
    ...
  3. All categorical feature values are transformed to numerical using the following formula:
    avg_target=countInClass+priortotalCount+1avg\_target = \frac{countInClass + prior}{totalCount + 1}

    • countInClasscountInClass is how many times the label value was equal to 1 for objects with the current categorical feature value.
    • priorprior is the preliminary value for the numerator. It is determined by the starting parameters.
    • totalCounttotalCount is the total number of objects (up to the current one) that have a categorical feature value matching the current one.

    Note

    These values are calculated individually for each object using data from previous objects.

    In the example with musical genres, j[1;3]j \in [1;3] accepts the values rock, pop, and indie, and prior is set to 0.05.

    The table below shows what the results of this stage look like.

    Object # f1f_{1} f2f_{2} ... fnf_{n} Function value
    1 4 53 ... 0,05 0
    2 3 55 ... 0,05 0
    3 2 40 ... 0,025 1
    4 5 42 ... 0,35 1
    5 5 34 ... 0,05 1
    6 2 48 ... 0,025 1
    7 2 45 ... 0,5125 0
    ...

One-hot encoding is also supported. Use one of the following training parameters to enable it.

Command-line version parameter Python parameter R parameter Description
--one-hot-max-size one_hot_max_size one_hot_max_size Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.

See details.