fit

Train a model.

Note

Set the task_type parameter in the class constructor to  to train the model on GPU. Training on GPU requires NVIDIA Driver of version 450.xx or higher.

Method call format

fit(X,
    y=None,
    cat_features=None,
    text_features=None,
    embedding_features=None,
    pairs=None,
    graph=None,
    sample_weight=None,
    group_id=None,
    group_weight=None,
    subgroup_id=None,
    pairs_weight=None,
    baseline=None,
    use_best_model=None,
    eval_set=None,
    verbose=None,
    logging_level=None,
    plot=False,
    plot_file=None,
    column_description=None,
    verbose_eval=None,
    metric_period=None,
    silent=None,
    early_stopping_rounds=None,
    save_snapshot=None,
    snapshot_file=None,
    snapshot_interval=None,
    init_model=None,
    log_cout=sys.stdout,
    log_cerr=sys.stderr)

Parameters

Some parameters duplicate the ones specified in the constructor of the CatBoost class. In these cases the values specified for the fit method take precedence. The rest of the training parameters must be set in the constructor of the CatBoost class.

X

Description

The description is different for each group of possible types.

Possible types

catboost.Pool

The input training dataset.

Note

If a nontrivial value of the cat_features parameter is specified in the constructor of this class, CatBoost checks the equivalence of categorical features indices specification from the constructor parameters and in this Pool class.

list, numpy.ndarray, pandas.DataFrame, pandas.Series

The input training dataset in the form of a two-dimensional feature matrix.

pandas.SparseDataFrame, scipy.sparse.spmatrix (all subclasses except dia_matrix)

The input training dataset in the form of a two-dimensional sparse feature matrix.

Default value

Required parameter

Supported processing units

CPU and GPU

y

Description

The target variables (in other words, the objects' label values) for the training dataset.

Must be in the form of a one- or two- dimensional array. The type of data in the array depends on the machine learning task being solved:

  • Regression and ranking — One-dimensional array of numeric values.

  • Multiregression - Two-dimensional array of numeric values. The first index is for a dimension, the second index is for an object.

  • Binary classification
    One-dimensional array containing one of:

    • Booleans, integers or strings that represent the labels of the classes (only two unique values).

    • Numeric values.
      The interpretation of numeric values depends on the selected loss function:

      • Logloss — The value is considered a positive class if it is strictly greater than the value of the target_border training parameter. Otherwise, it is considered a negative class.
      • CrossEntropy — The value is interpreted as the probability that the dataset object belongs to the positive class. Possible values are in the range [0; 1].
  • Multiclassification — One-dimensional array of integers or strings that represent the labels of the classes.

  • Multi label classification
    Two-dimensional array. The first index is for a label/class, the second index is for an object.

    Possible values depend on the selected loss function:

    • MultiLogloss — Only {0, 1} or {False, True} values are allowed that specify whether an object belongs to the class corresponding to the first index.
    • MultiCrossEntropy — Numerical values in the range [0; 1] that are interpreted as the probability that the dataset object belongs to the class corresponding to the first index.

Note

Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.

Possible types

  • list
  • numpy.ndarray
  • pandas.DataFrame
  • pandas.Series

Default value

None

Supported processing units

CPU and GPU

cat_features

Description

A one-dimensional array of categorical columns indices.

Use it only if the X parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).

Note

The cat_features parameter can also be specified in the constructor of the class. If it is, CatBoost checks the equivalence of the cat_features parameter specified in this method and in the constructor of the class.

Possible types

  • list
  • numpy.ndarray

Default value

None (all features are either considered numerical or of other types if specified precisely)

Supported processing units

CPU and GPU

text_features

Description

A one-dimensional array of text columns indices (specified as integers) or names (specified as strings).

Use only if the data parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).

If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data parameter.

Possible types

  • list
  • numpy.ndarray

Default value

None (all features are either considered numerical or of other types if specified precisely)

Supported processing units

CPU and GPU

embedding_features

Description

A one-dimensional array of embedding columns indices (specified as integers) or names (specified as strings).

Use only if the data parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).

If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data parameter.

Possible types

  • list
  • numpy.ndarray

Default value

Default value

None (all features are either considered numerical or of other types if specified precisely)

Supported processing units

CPU and GPU

pairs

Description

The pairs description in the form of a two-dimensional matrix of shape N by 2:

  • N is the number of pairs.
  • The first element of the pair is the zero-based index of the winner object from the input dataset for pairwise comparison.
  • The second element of the pair is the zero-based index of the loser object from the input dataset for pairwise comparison.

This information is used for calculation and optimization of Pairwise metrics.

Possible types

  • list
  • numpy.ndarray
  • pandas.DataFrame

Default value

None

Pairwise metrics require pairs data. If this data is not provided explicitly by specifying this parameter, pairs are generated automatically in each group using object label values.

Supported processing units

CPU and GPU

graph

Description

The graph description in the form of a two-dimensional matrix of shape N by 2:

  • N is the number of edges.
  • The first element of the edge is the zero-based index of start vertex (object) from the input dataset.
  • The second element of the edge is the zero-based index of end vertex (object) from the input dataset.

Graph information is used to calculate the graph aggregated features.

Possible types

  • list
  • numpy.ndarray
  • pandas.DataFrame

Default value

None

Supported processing units

CPU and GPU

sample_weight

Description

The weight of each object in the input data in the form of a one-dimensional array-like data.

By default, it is set to 1 for all objects.

Possible types

  • list
  • numpy.ndarray
  • pandas.DataFrame
  • pandas.Series

Default value

None

Supported processing units

CPU and GPU

group_id

Description

Group identifiers for all input objects. Supported identifier types are:

  • int
  • string types (string or unicode for Python 2 and bytes or string for Python 3).

Possible types

  • list
  • numpy.ndarray

Default value

None

Supported processing units

CPU

group_weight

Description

The weights of all objects within the defined groups from the input data in the form of one-dimensional array-like data.

Used for calculating the final values of trees. By default, it is set to 1 for all objects in all groups.

Alert

Only one of the following parameters can be used at a time:

  • weight
  • group_weight

Possible types

  • list
  • numpy.ndarray

Default value

None

Supported processing units

CPU

subgroup_id

Description

Subgroup identifiers for all input objects. Supported identifier types are:

  • int
  • string types (string or unicode for Python 2 and bytes or string for Python 3).

Possible types

  • list
  • numpy.ndarray

Default value

None

Supported processing units

CPU

pairs_weight

Description

The weight of each input pair of objects in the form of one-dimensional array-like pairs. The number of given values must match the number of specified pairs.

This information is used for calculation and optimization of Pairwise metrics.

By default, it is set to 1 for all pairs.

Possible types

  • list
  • numpy.ndarray

Default value

None

Supported processing units

CPU and GPU

baseline

Description

Array of formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.

Note

Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.

Possible types

  • list
  • numpy.ndarray

Default value

None

Supported processing units

CPU and GPU

use_best_model

Description

If this parameter is set, the number of trees that are saved in the resulting model is defined as follows:

  1. Build the number of trees defined by the training parameters.
  2. Use the validation dataset to identify the iteration with the optimal value of the metric specified in  --eval-metric (--eval-metric).

No trees are saved after this iteration.

This option requires a validation dataset to be provided.

Possible types

bool

Default value

True if validation sets are specified (the eval_set parameter is defined) and at least one of the label values of objects in the last validation dataset differs from the others. False otherwise.

Supported processing units

CPU and GPU

eval_set

Description

The validation dataset or datasets used for the following processes:

Possible types

  • catboost.Pool
  • list of catboost.Pool
  • tuple (X, y)
  • list of tuples (X, y)
  • string (path to the dataset file)
  • list of strings (paths to dataset files)

Default value

None

Supported processing units

CPU and GPU

Note

GPU training does not support multiple validation datasets for now

verbose

Alias:verbose_eval

Description

The purpose of this parameter depends on the type of the given value:

  • bool — Defines the logging level:

    • True  corresponds to the Verbose logging level
    • False corresponds to the Silent logging level
  • int — Use the Verbose logging level and set the logging period to the value of this parameter.

Alert

Do not use this parameter with the logging_level parameter.

Possible types

  • bool
  • int

Default value

1

Supported processing units

CPU and GPU

logging_level

Description

The logging level to output to stdout.

Possible values:

  • Silent — Do not output any logging information to stdout.

  • Verbose — Output the following data to stdout:

    • optimized metric
    • elapsed time of training
    • remaining time of training
  • Info — Output additional information and the number of trees.

  • Debug — Output debugging information.

Possible types

string

Default value

None (corresponds to the Verbose logging level)

Supported processing units

CPU and GPU

plot

Description

Plot the following information during training:

  • the metric values;
  • the custom loss values;
  • the loss function change during feature selection;
  • the time has passed since training started;
  • the remaining time until the end of training.
    This option can be used if training is performed in Jupyter notebook.

Possible types

bool

Default value

False

Supported processing units

CPU

plot_file

Description

Save a plot with the training progress information (metric values, custom loss values) to the file specified by this parameter.

Possible types

File-like object or string

Default value

None

Supported processing units

CPU and GPU

column_description

Description

The path to the input file that contains the columns description.

The given file is used to build pools from the train and/or validation datasets, which are input from files.

Possible types:

string

Default value

None

Supported processing units

CPU and GPU

metric_period

Description

The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer.

The usage of this parameter speeds up the training.

Note

It is recommended to increase the value of this parameter to maintain training speed if a GPU processing unit type is used.

Possible types

int

Default value

1

Supported processing units

CPU and GPU

silent

Description

Defines the logging level:

  • True — corresponds to the Silent logging level
  • False — corresponds to the Verbose logging level

Possible types

bool

Default value

False

Supported processing units

CPU and GPU

early_stopping_rounds

Description

Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.

Possible types

int

Default value

False

Supported processing units

CPU and GPU

save_snapshot

Description

Enable snapshotting for restoring the training progress after an interruption. If enabled, the default period for making snapshots is 600 seconds. Use the snapshot_interval parameter to change this period.

Note

This parameter is not supported in the params parameter of the cv function.

Possible types

bool

Default value

None

Supported processing units

CPU and GPU

snapshot_file

Description

The name of the file to save the training progress information in. This file is used for recovering training after an interruption.

Depending on whether the specified file exists in the file system:

  • Missing — Write information about training progress to the specified file.
  • Exists — Load data from the specified file and continue training from where it left off.

Note

This parameter is not supported in the params parameter of the cv function.

Possible types

string

Default value

experiment.cbsnapshot

Supported processing units

CPU and GPU

snapshot_interval

Description

The interval between saving snapshots in seconds.

The first snapshot is taken after the specified number of seconds since the start of training. Every subsequent snapshot is taken after the specified number of seconds since the previous one. The last snapshot is taken at the end of the training.

Note

This parameter is not supported in the params parameter of the cv function.

Possible types

int

Default value

600

Supported processing units

CPU and GPU

init_model

Description

The description is different for each group of possible types.

The model to continue learning from.

Note

The initial model must have the same problem type as the one being solved in the current training (binary classification, multiclassification or regression/ranking).

Possible types

catboost.CatBoost, catboost.CatBoostClassifier, catboost.CatBoostRegressor

The initial model object.

string

The path to the input file that contains the initial model.

Default value

None (incremental learning is not used)

Supported processing units

CPU

log_cout

Output stream or callback for logging.

Possible types

  • callable Python object
  • python object providing the write() method

Default value

sys.stdout

log_cerr

Error stream or callback for logging.

Possible types

  • callable Python object
  • python object providing the write() method

Default value

sys.stderr

Previous