fit

Train a model.

Note. Set the task_type parameter in the class constructor to GPU to train the model on GPU. Training on GPU requires NVIDIA Driver of version 390.xx or higher.

Method call format

fit(X, 
    y=None,
    cat_features=None,
    sample_weight=None,
    baseline=None,
    use_best_model=None,
    eval_set=None,
    verbose=None, 
    logging_level=None
    plot=False,
    column_description=None,
    verbose_eval=None, 
    metric_period=None, 
    silent=None, 
    early_stopping_rounds=None,
    save_snapshot=None, 
    snapshot_file=None, 
    snapshot_interval=None,
    init_model=None)

Parameters

Some parameters duplicate the ones specified in the constructor of the CatBoostClassifier class. In these cases the values specified for the fit method take precedence. The rest of the training parameters must be set in the constructor of the CatBoostClassifier class.

Parameter Possible types Description Default value Supported processing units
X catboost.Pool

The input training dataset.

Note.

If a nontrivial value of the cat_features parameter is specified in the constructor of this class, CatBoost checks the equivalence of categorical features indices specification from the constructor parameters and in this Pool class.

Required parameter

CPU and GPU

  • list
  • numpy.array
  • pandas.DataFrame
  • pandas.Series
  • string

The input training dataset in the form of a two-dimensional feature matrix.

catboost.FeaturesData

The input training dataset.

Note.

If a nontrivial value of the cat_features parameter is specified in the constructor of this class, it is prohibited to pass objects of this type.

y
  • list
  • numpy.array
  • pandas.DataFrame
  • pandas.Series

The target variables (in other words, the objects' label values) for the training dataset.

Must be in the form of a one-dimensional array. The type of data in the array depends on the machine learning task being solved:

  • Binary classification — Numeric values.

    The interpretation of numeric values depends on the selected loss function:

    • Logloss — The value is considered a positive class if it is strictly grater than the value of the border parameter of the loss function. Otherwise, it is considered a negative class.
    • CrossEntropy — The value is interpreted as the probability that the dataset object belongs to the positive class. Possible values are in the range [0; 1].
  • Multiclassification — Integers or strings that represents the labels of the classes.
Note.
  • Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.

  • Only integers in the [0; classes\_count – 1] range can be passed as the target veriables in this parameter if the classes_countparameter is set in the constructor of the CatBoostClassifier class.
None

CPU and GPU

cat_features
  • list
  • numpy.array

A one-dimensional array of categorical columns indices.

Use it only if the X parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).

Note.

The cat_features parameter can also be specified in the constructor of the class. If it is, CatBoost checks the equivalence of the cat_features parameter specified in this method and in the constructor of the class.

None (all features are considered numerical)

CPU and GPU

sample_weight
  • list
  • numpy.array
  • pandas.DataFrame
  • pandas.Series

The weight of each object in the input data in the form of a one-dimensional array-like data.

By default, it is set to 1 for all objects.

None

CPU and GPU

baseline
  • list
  • numpy.array

Array of formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.

Note. Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.
None

CPU and GPU

use_best_model bool
If this parameter is set, the number of trees that are saved in the resulting model is defined as follows:
  1. Build the number of trees defined by the training parameters.
  2. Use the validation dataset to identify the iteration with the optimal value of the metric specified in  --eval-metric (eval_metric).

No trees are saved after this iteration.

This option requires a validation dataset to be provided.

True if a validation set is input (the eval_set parameter is defined) and at least one of the label values of objects in this set differs from the others. False otherwise.

CPU and GPU

eval_set
  • catboost.Pool
  • list of catboost.Pool
  • tuple (x, y)
  • list of tuples (x, y)
  • string (path to the dataset file)
  • list of strings (paths to dataset files)
The validation dataset or datasets used for the following processes:
None

CPU and GPU

Note. Only a single validation dataset can be input if the training is performed on GPU

verbose

Alias: verbose_eval

  • bool
  • int

The purpose of this parameter depends on the type of the given value:

  • bool — Defines the logging level:
    • “True”  corresponds to the Verbose logging level
    • “False” corresponds to the Silent logging level
  • int — Use the Verbose logging level and set the logging period to the value of this parameter.
Restriction. Do not use this parameter with the logging_level parameter.
1

CPU and GPU

logging_level string

The logging level to output to stdout.

Possible values:
  • Silent — Do not output any logging information to stdout.

  • Verbose — Output the following data to stdout:

    • optimized metric
    • elapsed time of training
    • remaining time of training
  • Info — Output additional information and the number of trees.

  • Debug — Output debugging information.
None (corresponds to the Verbose logging level)

CPU and GPU

plot bool
Plot the following information during training:
  • the metric values;
  • the custom loss values;
  • the time has passed since training started;
  • the remaining time until the end of training.
This option can be used if training is performed in Jupyter notebook.
False

CPU

column_description string

The path to the input file that contains the columns description.

The given file is used to build pools from the train and/or validation datasets, which are input from files.

None

CPU and GPU

silent bool Defines the logging level:
  • “True” — corresponds to the Silent logging level
  • “False” — corresponds to the Verbose logging level
False

CPU and GPU

early_stopping_rounds int