fit
- Method call format
- Parameters
- X
- y
- cat_features
- text_features
- embedding_features
- pairs
- graph
- sample_weight
- group_id
- group_weight
- subgroup_id
- pairs_weight
- baseline
- use_best_model
- eval_set
- verbose
- logging_level
- plot
- plot_file
- column_description
- metric_period
- silent
- early_stopping_rounds
- save_snapshot
- snapshot_file
- snapshot_interval
- init_model
- log_cout
- log_cerr
Train a model.
Note
Set the task_type
parameter in the class constructor to to train the model on GPU. Training on GPU requires NVIDIA Driver of version 450.xx or higher.
Method call format
fit(X,
y=None,
cat_features=None,
text_features=None,
embedding_features=None,
pairs=None,
graph=None,
sample_weight=None,
group_id=None,
group_weight=None,
subgroup_id=None,
pairs_weight=None,
baseline=None,
use_best_model=None,
eval_set=None,
verbose=None,
logging_level=None,
plot=False,
plot_file=None,
column_description=None,
verbose_eval=None,
metric_period=None,
silent=None,
early_stopping_rounds=None,
save_snapshot=None,
snapshot_file=None,
snapshot_interval=None,
init_model=None,
log_cout=sys.stdout,
log_cerr=sys.stderr)
Parameters
Some parameters duplicate the ones specified in the constructor of the CatBoost class. In these cases the values specified for the fit method take precedence. The rest of the training parameters must be set in the constructor of the CatBoost class.
X
Description
The description is different for each group of possible types.
Possible types
catboost.Pool
The input training dataset.
Note
If a nontrivial value of the cat_features
parameter is specified in the constructor of this class, CatBoost checks the equivalence of categorical features indices specification from the constructor parameters and in this Pool class.
list, numpy.ndarray, pandas.DataFrame, pandas.Series
The input training dataset in the form of a two-dimensional feature matrix.
pandas.SparseDataFrame, scipy.sparse.spmatrix (all subclasses except dia_matrix)
The input training dataset in the form of a two-dimensional sparse feature matrix.
Default value
Required parameter
Supported processing units
CPU and GPU
y
Description
The target variables (in other words, the objects' label values) for the training dataset.
Must be in the form of a one- or two- dimensional array. The type of data in the array depends on the machine learning task being solved:
-
Regression and ranking — One-dimensional array of numeric values.
-
Multiregression - Two-dimensional array of numeric values. The first index is for a dimension, the second index is for an object.
-
Binary classification
One-dimensional array containing one of:-
Booleans, integers or strings that represent the labels of the classes (only two unique values).
-
Numeric values.
The interpretation of numeric values depends on the selected loss function:- Logloss — The value is considered a positive class if it is strictly greater than the value of the
target_border
training parameter. Otherwise, it is considered a negative class. - CrossEntropy — The value is interpreted as the probability that the dataset object belongs to the positive class. Possible values are in the range
[0; 1]
.
- Logloss — The value is considered a positive class if it is strictly greater than the value of the
-
-
Multiclassification — One-dimensional array of integers or strings that represent the labels of the classes.
-
Multi label classification
Two-dimensional array. The first index is for a label/class, the second index is for an object.Possible values depend on the selected loss function:
- MultiLogloss — Only {0, 1} or {False, True} values are allowed that specify whether an object belongs to the class corresponding to the first index.
- MultiCrossEntropy — Numerical values in the range
[0; 1]
that are interpreted as the probability that the dataset object belongs to the class corresponding to the first index.
Note
Do not use this parameter if the input training dataset (specified in the X
parameter) type is catboost.Pool.
Possible types
- list
- numpy.ndarray
- pandas.DataFrame
- pandas.Series
Default value
None
Supported processing units
CPU and GPU
cat_features
Description
A one-dimensional array of categorical columns indices.
Use it only if the X
parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).
Note
The cat_features
parameter can also be specified in the constructor of the class. If it is, CatBoost checks the equivalence of the cat_features
parameter specified in this method and in the constructor of the class.
Possible types
- list
- numpy.ndarray
Default value
None (all features are either considered numerical or of other types if specified precisely)
Supported processing units
CPU and GPU
text_features
Description
A one-dimensional array of text columns indices (specified as integers) or names (specified as strings).
Use only if the data
parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).
If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names
parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data
parameter.
Possible types
- list
- numpy.ndarray
Default value
None (all features are either considered numerical or of other types if specified precisely)
Supported processing units
CPU and GPU
embedding_features
Description
A one-dimensional array of embedding columns indices (specified as integers) or names (specified as strings).
Use only if the data
parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).
If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names
parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data
parameter.
Possible types
- list
- numpy.ndarray
Default value
Default value
None (all features are either considered numerical or of other types if specified precisely)
Supported processing units
CPU and GPU
pairs
Description
The pairs description in the form of a two-dimensional matrix of shape N
by 2:
N
is the number of pairs.- The first element of the pair is the zero-based index of the winner object from the input dataset for pairwise comparison.
- The second element of the pair is the zero-based index of the loser object from the input dataset for pairwise comparison.
This information is used for calculation and optimization of Pairwise metrics.
Possible types
- list
- numpy.ndarray
- pandas.DataFrame
Default value
None
Pairwise metrics require pairs data. If this data is not provided explicitly by specifying this parameter, pairs are generated automatically in each group using object label values.
Supported processing units
CPU and GPU
graph
Description
The graph description in the form of a two-dimensional matrix of shape N
by 2:
N
is the number of edges.- The first element of the edge is the zero-based index of start vertex (object) from the input dataset.
- The second element of the edge is the zero-based index of end vertex (object) from the input dataset.
Graph information is used to calculate the graph aggregated features.
Possible types
- list
- numpy.ndarray
- pandas.DataFrame
Default value
None
Supported processing units
CPU and GPU
sample_weight
Description
The weight of each object in the input data in the form of a one-dimensional array-like data.
By default, it is set to 1 for all objects.
Possible types
- list
- numpy.ndarray
- pandas.DataFrame
- pandas.Series
Default value
None
Supported processing units
CPU and GPU
group_id
Description
Group identifiers for all input objects. Supported identifier types are:
- int
- string types (string or unicode for Python 2 and bytes or string for Python 3).
Possible types
- list
- numpy.ndarray
Default value
None
Supported processing units
CPU
group_weight
Description
The weights of all objects within the defined groups from the input data in the form of one-dimensional array-like data.
Used for calculating the final values of trees. By default, it is set to 1 for all objects in all groups.
Alert
Only one of the following parameters can be used at a time:
weight
group_weight
Possible types
- list
- numpy.ndarray
Default value
None
Supported processing units
CPU
subgroup_id
Description
Subgroup identifiers for all input objects. Supported identifier types are:
- int
- string types (string or unicode for Python 2 and bytes or string for Python 3).
Possible types
- list
- numpy.ndarray
Default value
None
Supported processing units
CPU
pairs_weight
Description
The weight of each input pair of objects in the form of one-dimensional array-like pairs. The number of given values must match the number of specified pairs.
This information is used for calculation and optimization of Pairwise metrics.
By default, it is set to 1 for all pairs.
Possible types
- list
- numpy.ndarray
Default value
None
Supported processing units
CPU and GPU
baseline
Description
Array of formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.
Note
Do not use this parameter if the input training dataset (specified in the X
parameter) type is catboost.Pool.
Possible types
- list
- numpy.ndarray
Default value
None
Supported processing units
CPU and GPU
use_best_model
Description
If this parameter is set, the number of trees that are saved in the resulting model is defined as follows:
- Build the number of trees defined by the training parameters.
- Use the validation dataset to identify the iteration with the optimal value of the metric specified in
--eval-metric
(--eval-metric
).
No trees are saved after this iteration.
This option requires a validation dataset to be provided.
Possible types
bool
Default value
True if validation sets are specified (the eval_set
parameter is defined) and at least one of the label values of objects in the last validation dataset differs from the others. False otherwise.
Supported processing units
CPU and GPU
eval_set
Description
The validation dataset or datasets used for the following processes:
- overfitting detector
- best iteration selection
- monitoring metrics' changes
Possible types
- catboost.Pool
- list of catboost.Pool
- tuple (X, y)
- list of tuples (X, y)
- string (path to the dataset file)
- list of strings (paths to dataset files)
Default value
None
Supported processing units
CPU and GPU
Note
GPU training does not support multiple validation datasets for now
verbose
Alias:verbose_eval
Description
The purpose of this parameter depends on the type of the given value:
-
bool — Defines the logging level:
True
corresponds to the Verbose logging levelFalse
corresponds to the Silent logging level
-
int — Use the Verbose logging level and set the logging period to the value of this parameter.
Alert
Do not use this parameter with the logging_level
parameter.
Possible types
- bool
- int
Default value
1
Supported processing units
CPU and GPU
logging_level
Description
The logging level to output to stdout.
Possible values:
-
Silent — Do not output any logging information to stdout.
-
Verbose — Output the following data to stdout:
- optimized metric
- elapsed time of training
- remaining time of training
-
Info — Output additional information and the number of trees.
-
Debug — Output debugging information.
Possible types
string
Default value
None (corresponds to the Verbose logging level)
Supported processing units
CPU and GPU
plot
Description
Plot the following information during training:
- the metric values;
- the custom loss values;
- the loss function change during feature selection;
- the time has passed since training started;
- the remaining time until the end of training.
This option can be used if training is performed in Jupyter notebook.
Possible types
bool
Default value
False
Supported processing units
CPU
plot_file
Description
Save a plot with the training progress information (metric values, custom loss values) to the file specified by this parameter.
Possible types
File-like object or string
Default value
None
Supported processing units
CPU and GPU
column_description
Description
The path to the input file that contains the columns description.
The given file is used to build pools from the train and/or validation datasets, which are input from files.
Possible types:
string
Default value
None
Supported processing units
CPU and GPU
metric_period
Description
The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer.
The usage of this parameter speeds up the training.
Note
It is recommended to increase the value of this parameter to maintain training speed if a GPU processing unit type is used.
Possible types
int
Default value
1
Supported processing units
CPU and GPU
silent
Description
Defines the logging level:
True
— corresponds to the Silent logging levelFalse
— corresponds to the Verbose logging level
Possible types
bool
Default value
False
Supported processing units
CPU and GPU
early_stopping_rounds
Description
Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.
Possible types
int
Default value
False
Supported processing units
CPU and GPU
save_snapshot
Description
Enable snapshotting for restoring the training progress after an interruption. If enabled, the default period for making snapshots is 600 seconds. Use the snapshot_interval
parameter to change this period.
Note
This parameter is not supported in the params
parameter of the cv function.
Possible types
bool
Default value
None
Supported processing units
CPU and GPU
snapshot_file
Description
The name of the file to save the training progress information in. This file is used for recovering training after an interruption.
Depending on whether the specified file exists in the file system:
- Missing — Write information about training progress to the specified file.
- Exists — Load data from the specified file and continue training from where it left off.
Note
This parameter is not supported in the params
parameter of the cv function.
Possible types
string
Default value
experiment.cbsnapshot
Supported processing units
CPU and GPU
snapshot_interval
Description
The interval between saving snapshots in seconds.
The first snapshot is taken after the specified number of seconds since the start of training. Every subsequent snapshot is taken after the specified number of seconds since the previous one. The last snapshot is taken at the end of the training.
Note
This parameter is not supported in the params
parameter of the cv function.
Possible types
int
Default value
600
Supported processing units
CPU and GPU
init_model
Description
The description is different for each group of possible types.
The model to continue learning from.
Note
The initial model must have the same problem type as the one being solved in the current training (binary classification, multiclassification or regression/ranking).
Possible types
catboost.CatBoost, catboost.CatBoostClassifier, catboost.CatBoostRegressor
The initial model object.
string
The path to the input file that contains the initial model.
Default value
None (incremental learning is not used)
Supported processing units
CPU
log_cout
Output stream or callback for logging.
Possible types
- callable Python object
- python object providing the
write()
method
Default value
sys.stdout
log_cerr
Error stream or callback for logging.
Possible types
- callable Python object
- python object providing the
write()
method
Default value
sys.stderr