grid_search

A simple grid search over specified parameter values for a model.

Note

After searching, the model is trained and ready to use.

Method call format

grid_search(param_grid,
            X,
            y=None,
            cv=3,
            partition_random_seed=0,
            calc_cv_statistics=True,
            search_by_train_test_split=True,
            refit=True,
            shuffle=True,
            stratified=None,
            train_size=0.8,
            verbose=True,
            plot=False,
            log_cout=sys.stdout,
            log_cerr=sys.stderr)

Parameters

param_grid

Description

Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored.

This enables searching over any sequence of parameter settings.

Possible types

  • dict
  • list

Default value

Required parameter

X

Description

The description is different for each group of possible types.

Possible types

catboost.Pool

The input training dataset.

Note

If a nontrivial value of the cat_features parameter is specified in the constructor of this class, CatBoost checks the equivalence of categorical features indices specification from the constructor parameters and in this Pool class.

numpy.array, pandas.DataFrame

The input training dataset in the form of a two-dimensional feature matrix.

pandas.SparseDataFrame, scipy.sparse.spmatrix (all subclasses except dia_matrix)

The input training dataset in the form of a two-dimensional sparse feature matrix.

Possible types

The input training dataset in the form of a two-dimensional sparse feature matrix.

Default value

Required parameter

y

Description

The target variables (in other words, the objects' label values) for the training dataset.

Must be in the form of a one-dimensional array. The type of data in the array depends on the machine learning task being solved:

  • Regression , multiregression and ranking  — Numeric values.

  • Binary classification — Numeric values.

    The interpretation of numeric values depends on the selected loss function:

    • Logloss — The value is considered a positive class if it is strictly greater than the value of the `` parameter of the loss function. Otherwise, it is considered a negative class.
    • CrossEntropy — The value is interpreted as the probability that the dataset object belongs to the positive class. Possible values are in the range [0; 1].
  • Multiclassification — Integers or strings that represents the labels of the classes.

Note

Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.

Possible types

  • numpy.array
  • pandas.Series

Default value

None

cv

Description

The cross-validation splitting strategy.

The interpretation of this parameter depends on the input data type:

  • None¬†вАФ Use the default three-fold cross-validation.

  • int¬†вАФ The number of folds in a (Stratified)KFold

  • object — One of the scikit-learn Splitter Classes with the split method.

  • An iterable yielding train and test splits as arrays of indices.

Possible types

  • int
  • scikit-learn splitter object
  • cross-validation generator
  • iterable

Default value

None

partition_random_seed

Description

Use this as the seed value for random permutation of the data.

The permutation is performed before splitting the data for cross-validation.

Each seed generates unique data splits.

Possible types

int

Default value

0

calc_cv_statistics

Description

Estimate the quality by using cross-validation with the best of the found parameters. The model is fitted using these parameters.

This option can be enabled if the search_by_train_test_split parameter is set to True.

Possible types

bool

Default value

True

search_by_train_test_split

Description

Split the source dataset into train and test parts. Models are trained on the train part, while parameters are compared by the loss function score on the test dataset.

It is recommended to enable this option for large datasets and disable it for the small ones.

Possible types

bool

Default value

True

refit

Description

Refit an estimator using the best-found parameters on the whole dataset.

Possible types

bool

Default value

True

shuffle

Description

Shuffle the dataset objects before splitting into folds.

Possible types

bool

Default value

True

stratified

Description

Perform stratified sampling. True for classification and False otherwise.

Possible types

bool

Default value

None

train_size

Description

The proportion of the dataset to include in the train split.

Possible values are in the range [0;1].

Possible types

float

Default value

0.8

verbose

Description

The purpose of this parameter depends on the type of the given value:

  • int¬†вАФ The frequency of iterations to print the information to stdout.
  • bool¬†вАФ Print the information to stdout on every iteration (if set to True) or disable any logging (if set to False).

Possible types

  • bool
  • int

Default value

True

plot

Description

Draw train and evaluation metrics for every set of parameters in Jupyter Jupyter Notebook.

Possible types

bool

Default value

False

log_cout

Output stream or callback for logging.

Possible types

  • callable Python object
  • python object providing the write() method

Default value

sys.stdout

log_cerr

Error stream or callback for logging.

Possible types

  • callable Python object
  • python object providing the write() method

Default value

sys.stderr

Return value

Dict with two fields:

  • paramsdict of best-found parameters.
  • cv_resultsdict or pandas.core.frame.DataFrame with cross-validation results. Сolumns are: test-error-mean, test-error-std, train-error-mean, train-error-std.

Examples

from catboost import CatBoostRegressor
import numpy as np

train_data = np.random.randint(1, 100, size=(100, 10))
train_labels = np.random.randint(2, size=(100))

model = CatBoostRegressor()

grid = {'learning_rate': [0.03, 0.1],
        'depth': [4, 6, 10],
        'l2_leaf_reg': [1, 3, 5, 7, 9]}

grid_search_result = model.grid_search(grid,
                                       X=train_data,
                                       y=train_labels,
                                       plot=True)

The following is a chart plotted with Jupyter Notebook for the given example.