fit

Train a model.

Note

Set the task_type parameter in the class constructor to to train the model on GPU. Training on GPU requires NVIDIA Driver of version 450.xx or higher.

Method call format

fit(X,
    y=None,
    cat_features=None,
    text_features=None,
    embedding_features=None,
    sample_weight=None,
    baseline=None,
    use_best_model=None,
    eval_set=None,
    verbose=None,
    logging_level=None
    plot=False,
    plot_file=None,
    column_description=None,
    verbose_eval=None,
    metric_period=None,
    silent=None,
    early_stopping_rounds=None,
    save_snapshot=None,
    snapshot_file=None,
    snapshot_interval=None,
    init_model=None,
    log_cout=sys.stdout,
    log_cerr=sys.stderr)

Parameters

Some parameters duplicate the ones specified in the constructor of the CatBoostClassifier class. In these cases the values specified for the fit method take precedence. The rest of the training parameters must be set in the constructor of the CatBoostClassifier class.

X

Description

The description is different for each group of possible types.

Possible types

catboost.Pool

The input training dataset.

Note

If a nontrivial value of the cat_features parameter is specified in the constructor of this class, CatBoost checks the equivalence of categorical features indices specification from the constructor parameters and in this Pool class.

list, numpy.ndarray, pandas.DataFrame, pandas.Series

The input training dataset in the form of a two-dimensional feature matrix.

pandas.SparseDataFrame, scipy.sparse.spmatrix (all subclasses except dia_matrix)

The input training dataset in the form of a two-dimensional sparse feature matrix.

Default value

Required parameter

Supported processing units

CPU and GPU

y

Description

The target variables (in other words, the objects' label values) for the training dataset.

Must be in the form of a one- or two- dimensional array. The type of data in the array depends on the machine learning task being solved:

Binary classification
One-dimensional array containing one of:
- Booleans, integers or strings that represent the labels of the classes (only two unique values).
- Numeric values.
  The interpretation of numeric values depends on the selected loss function:
  - Logloss — The value is considered a positive class if it is strictly greater than the value of the target_border training parameter. Otherwise, it is considered a negative class.
  - CrossEntropy — The value is interpreted as the probability that the dataset object belongs to the positive class. Possible values are in the range [0; 1].
Multiclassification — One-dimensional array of integers or strings that represent the labels of the classes.
Multi label classification
Two-dimensional array. The first index is for a label/class, the second index is for an object.

Possible values depend on the selected loss function:
- MultiLogloss — Only {0, 1} or {False, True} values are allowed that specify whether an object belongs to the class corresponding to the first index.
- MultiCrossEntropy — Numerical values in the range [0; 1] that are interpreted as the probability that the dataset object belongs to the class corresponding to the first index.

Note

Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.

Possible types

list
numpy.ndarray
pandas.DataFrame
pandas.Series

Default value

None

Supported processing units

CPU and GPU

cat_features

Description

A one-dimensional array of categorical columns indices.

Use it only if the X parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).

Note

The cat_features parameter can also be specified in the constructor of the class. If it is, CatBoost checks the equivalence of the cat_features parameter specified in this method and in the constructor of the class.

Possible types

list
numpy.ndarray

Default value

None (all features are either considered numerical or of other types if specified precisely)

Supported processing units

CPU and GPU

text_features

Description

A one-dimensional array of text columns indices (specified as integers) or names (specified as strings).

Use only if the data parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).

If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data parameter.

Possible types

list
numpy.ndarray

Default value

None (all features are either considered numerical or of other types if specified precisely)

Supported processing units

CPU and GPU

embedding_features

Description

A one-dimensional array of embedding columns indices (specified as integers) or names (specified as strings).

Use only if the data parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).

Possible types

list
numpy.ndarray

Default value

None (all features are either considered numerical or of other types if specified precisely)

Supported processing units

CPU and GPU

sample_weight

Description

The weight of each object in the input data in the form of a one-dimensional array-like data.

By default, it is set to 1 for all objects.

Possible types

list
numpy.ndarray
pandas.DataFrame
pandas.Series

Default value

None

Supported processing units

CPU and GPU

baseline

Description

Array of formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.

Note

Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.

Possible types

list
numpy.ndarray

Default value

None

Supported processing units

CPU and GPU

use_best_model

Description

If this parameter is set, the number of trees that are saved in the resulting model is defined as follows:

Build the number of trees defined by the training parameters.
Use the validation dataset to identify the iteration with the optimal value of the metric specified in --eval-metric (--eval-metric).

No trees are saved after this iteration.

This option requires a validation dataset to be provided.

Possible types

bool

Default value

True if validation sets are specified (the eval_set parameter is defined) and at least one of the label values of objects in the last validation dataset differs from the others. False otherwise.

Supported processing units

CPU and GPU

eval_set

Description

The validation dataset or datasets used for the following processes:

overfitting detector
the best iteration selection
monitoring metrics' changes

Possible types

catboost.Pool
list of catboost.Pool
tuple (X, y)
list of tuples (X, y)
string (path to the dataset file)
list of strings (paths to dataset files)

Default value

None

Supported processing units

CPU and GPU

Note

GPU training does not support multiple validation datasets for now

verbose

Alias:verbose_eval

Description

The purpose of this parameter depends on the type of the given value:

bool — Defines the logging level:
- True corresponds to the Verbose logging level
- False corresponds to the Silent logging level
int — Use the Verbose logging level and set the logging period to the value of this parameter.

Alert

Do not use this parameter with the logging_level parameter.

Possible types

bool
int

Default value

Supported processing units

CPU and GPU

logging_level

Description

The logging level to output to stdout.

Possible values:

Silent — Do not output any logging information to stdout.
Verbose — Output the following data to stdout:
- optimized metric
- elapsed time of training
- remaining time of training
Info — Output additional information and the number of trees.
Debug — Output debugging information.

Possible types

string

Default value

None (corresponds to the Verbose logging level)

Supported processing units

CPU and GPU

plot

Description

Plot the following information during training:

the metric values;
the custom loss values;
the loss function change during feature selection;
the time has passed since training started;
the remaining time until the end of training.
This option can be used if training is performed in Jupyter notebook.

Possible types

bool

Default value

False

Supported processing units

CPU

plot_file

Description

Save a plot with the training progress information (metric values, custom loss values) to the file specified by this parameter.

Possible types

File-like object or string

Default value

None

Supported processing units

CPU and GPU

column_description

Description

The path to the input file that contains the columns description.

The given file is used to build pools from the train and/or validation datasets, which are input from files.

Default value

None

Supported processing units

CPU and GPU

silent

Description

Defines the logging level:

True — corresponds to the Silent logging level
False — corresponds to the Verbose logging level

Possible types

bool

Default value

False

Supported processing units

CPU and GPU

early_stopping_rounds

Description

Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.

Possible types

int

Default value

False

Supported processing units

CPU and GPU

save_snapshot

Description

Enable snapshotting for restoring the training progress after an interruption. If enabled, the default period for making snapshots is 600 seconds. Use the snapshot_interval parameter to change this period.

Note

This parameter is not supported in the params parameter of the cv function.

Possible types

bool

Default value

None

Supported processing units

CPU and GPU

snapshot_file

Description

The name of the file to save the training progress information in. This file is used for recovering training after an interruption.

Depending on whether the specified file exists in the file system:

Missing — Write information about training progress to the specified file.
Exists — Load data from the specified file and continue training from where it left off.

Note

This parameter is not supported in the params parameter of the cv function.

Possible types

string

Default value

experiment...

experiment.cbsnapshot

Supported processing units

CPU and GPU

snapshot_interval

Description

The interval between saving snapshots in seconds.
The first snapshot is taken after the specified number of seconds since the start of training. Every subsequent snapshot is taken after the specified number of seconds since the previous one. The last snapshot is taken at the end of the training.

Note

This parameter is not supported in the params parameter of the cv function.

Possible types

int

Default value

600

Supported processing units

CPU and GPU

init_model

Description

The description is different for each group of possible types.

Possible types

The model to continue learning from.

Note

The initial model must have the same problem type as the one being solved in the current training (binary classification, multiclassification or regression/ranking).

None (incremental learning is not used)CPU

{{ catboost.CatBoost, catboost.CatBoostClassifier](../concepts/python-reference_catboostclassifier.md) }}

The initial model object.

string

The path to the input file that contains the initial model.

Default value

None (incremental learning is not used)

Supported processing units

CPU

log_cout

Output stream or callback for logging.

Possible types

callable Python object
python object providing the write() method

Default value

sys.stdout

log_cerr

Error stream or callback for logging.

Possible types

callable Python object
python object providing the write() method

Default value

sys.stderr

Usage examples

Train a model using a matrix

from catboost import CatBoostClassifier

cat_features = [0,1,2]

train_data = [["a", "b", 1, 4, 5, 6],
              ["a", "b", 4, 5, 6, 7],
              ["c", "d", 30, 40, 50, 60]]

train_labels = [1,1,0]

model = CatBoostClassifier(iterations=20)

model.fit(train_data, train_labels, cat_features)
predictions = model.predict(train_data)

Load the dataset using Pool, train it with CatBoostClassifier and make a prediction

from catboost import CatBoostClassifier, Pool

train_data = Pool(data=[[1, 4, 5, 6],
                        [4, 5, 6, 7],
                        [30, 40, 50, 60]],
                  label=[1, 1, -1],
                  weight=[0.1, 0.2, 0.3])

model = CatBoostClassifier(iterations=10)

model.fit(train_data)
preds_class = model.predict(train_data)

Was the article helpful?

Overview

predict