grid_search
A simple grid search over specified parameter values for a model.
Note
After searching, the model is trained and ready to use.
Method call format
grid_search(param_grid,
X,
y=None,
cv=3,
partition_random_seed=0,
calc_cv_statistics=True,
search_by_train_test_split=True,
refit=True,
shuffle=True,
stratified=None,
train_size=0.8,
verbose=True,
plot=False,
log_cout=sys.stdout,
log_cerr=sys.stderr)
Parameters
param_grid
Description
TDictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored.
This enables searching over any sequence of parameter settings.
Possible types
- dict
- list
Default value
Required parameter
X
Description
The description is different for each group of possible types.
Possible types
catboost.Pool
The input training dataset.
Note
If a nontrivial value of the cat_features parameter is specified in the constructor of this class, CatBoost checks the equivalence of categorical features indices specification from the constructor parameters and in this Pool class.
numpy.ndarray, pandas.DataFrame
The input training dataset in the form of a two-dimensional feature matrix.
pandas.SparseDataFrame, scipy.sparse.spmatrix (all subclasses except dia_matrix)
The input training dataset in the form of a two-dimensional sparse feature matrix.
Default value Required parameter
y
Description
The target variables (in other words, the objects' label values) for the training dataset.
Must be in the form of a one- or two- dimensional array. The type of data in the array depends on the machine learning task being solved:
-
Regression and ranking — One-dimensional array of numeric values.
-
Multiregression - Two-dimensional array of numeric values. The first index is for a dimension, the second index is for an object.
-
Binary classification
One-dimensional array containing one of:-
Booleans, integers or strings that represent the labels of the classes (only two unique values).
-
Numeric values.
The interpretation of numeric values depends on the selected loss function:- Logloss — The value is considered a positive class if it is strictly greater than the value of the
target_border
training parameter. Otherwise, it is considered a negative class. - CrossEntropy — The value is interpreted as the probability that the dataset object belongs to the positive class. Possible values are in the range
[0; 1]
.
- Logloss — The value is considered a positive class if it is strictly greater than the value of the
-
-
Multiclassification — One-dimensional array of integers or strings that represent the labels of the classes.
-
Multi label classification
Two-dimensional array. The first index is for a label/class, the second index is for an object.Possible values depend on the selected loss function:
- MultiLogloss — Only {0, 1} or {False, True} values are allowed that specify whether an object belongs to the class corresponding to the first index.
- MultiCrossEntropy — Numerical values in the range
[0; 1]
that are interpreted as the probability that the dataset object belongs to the class corresponding to the first index.
Note
Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.
Possible types
- list
- numpy.ndarray
- pandas.DataFrame
- pandas.Series
Default value
None
Supported processing units
CPU and GPU
cv
Description
The cross-validation splitting strategy.
The interpretation of this parameter depends on the input data type:
- None — Use the default three-fold cross-validation.
- int — The number of folds in a (Stratified)KFold
- object — One of the scikit-learn Splitter Classes with the split method.
- An iterable yielding train and test splits as arrays of indices.
Possible types
- int
- scikit-learn splitter object
- cross-validation generator
- iterable
Default value
None
partition_random_seed
Description
Use this as the seed value for random permutation of the data.
The permutation is performed before splitting the data for cross-validation.
Each seed generates unique data splits.
Possible types
int
Default value
0
calc_cv_statistics
Description
Estimate the quality by using cross-validation with the best of the found parameters. The model is fitted using these parameters.
This option can be enabled if the search_by_train_test_split
parameter is set to True.
Possible types
bool
Default value
True
search_by_train_test_split
Description
Split the source dataset into train and test parts. Models are trained on the train part, while parameters are compared by the loss function score on the test dataset.
It is recommended to enable this option for large datasets and disable it for the small ones.
Possible types
bool
Default value True
refit
Description
Refit an estimator using the best-found parameters on the whole dataset.
Possible types
bool
Default value
True
shuffle
Description
Shuffle the dataset objects before splitting into folds.
Possible types
bool
Default value
True
stratified
Description
Perform stratified sampling. True for classification and False otherwise.
Possible types
bool
Default value
None
train_size
Description
The proportion of the dataset to include in the train split.
Possible values are in the range [0;1].
Possible types
float
Default value
0.8
verbose
Description
The purpose of this parameter depends on the type of the given value:
- int — The frequency of iterations to print the information to stdout.
- bool — Print the information to stdout on every iteration (if set to “True”) or disable any logging (if set to “False”).
Possible types
- bool
- int
Default value
True
plot
Description
Draw train and evaluation metrics for every set of parameters in Jupyter Jupyter Notebook.
Possible types
bool
Default value
False
log_cout
Output stream or callback for logging.
Possible types
- callable Python object
- python object providing the
write()
method
Default value
sys.stdout
log_cerr
Error stream or callback for logging.
Possible types
- callable Python object
- python object providing the
write()
method
Default value
sys.stderr
Return value
Dict with two fields:
params
—dict
of best-found parameters.cv_results
—dict
or pandas.core.frame.DataFrame with cross-validation results. Сolumns are:test-error-mean
,test-error-std
,train-error-mean
,train-error-std
.
Examples
from catboost import CatBoost
import numpy as np
train_data = np.random.randint(1, 100, size=(100, 10))
train_labels = np.random.randint(2, size=(100))
model = CatBoost()
grid = {'learning_rate': [0.03, 0.1],
'depth': [4, 6, 10],
'l2_leaf_reg': [1, 3, 5, 7, 9]}
grid_search_result = model.grid_search(grid,
X=train_data,
y=train_labels,
plot=True)
The following is a chart plotted with Jupyter Notebook for the given example.