randomized_search

A simple randomized search on hyperparameters.

In contrast to grid search, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is specified in the n_iter parameter.

Note

After searching, the model is trained and ready to use.

Method call format

randomized_search(param_distributions,
                  X,
                  y=None,
                  cv=3,
                  n_iter=10,
                  partition_random_seed=0,
                  calc_cv_statistics=True,
                  search_by_train_test_split=True,
                  refit=True,
                  shuffle=True,
                  stratified=None,
                  train_size=0.8,
                  verbose=True,
                  log_cout=sys.stdout,
                  log_cerr=sys.stderr)

Parameters

param_distributions

Description

Dictionary with parameters names (string) as keys and distributions or lists of parameter settings to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions).

If a list is given, it is sampled uniformly.

Possible types:

dict

Default value

Required parameter

X

Description

The description is different for each group of possible types.

Possible types

catboost.Pool

The input training dataset.

Note

If a nontrivial value of the cat_features parameter is specified in the constructor of this class, CatBoost checks the equivalence of categorical features indices specification from the constructor parameters and in this Pool class.

numpy.ndarray, pandas.DataFrame

The input training dataset in the form of a two-dimensional feature matrix.

pandas.SparseDataFrame, scipy.sparse.spmatrix (all subclasses except dia_matrix)

The input training dataset in the form of a two-dimensional sparse feature matrix.

Default value

Required parameter

y

Description

The target variables (in other words, the objects' label values) for the training dataset.

Must be in the form of a one- or two- dimensional array. The type of data in the array depends on the machine learning task being solved:

Regression and ranking — One-dimensional array of numeric values.
Multiregression - Two-dimensional array of numeric values. The first index is for a dimension, the second index is for an object.
Binary classification
One-dimensional array containing one of:
- Booleans, integers or strings that represent the labels of the classes (only two unique values).
- Numeric values.
  The interpretation of numeric values depends on the selected loss function:
  - Logloss — The value is considered a positive class if it is strictly greater than the value of the target_border training parameter. Otherwise, it is considered a negative class.
  - CrossEntropy — The value is interpreted as the probability that the dataset object belongs to the positive class. Possible values are in the range [0; 1].
Multiclassification — One-dimensional array of integers or strings that represent the labels of the classes.
Multi label classification
Two-dimensional array. The first index is for a label/class, the second index is for an object.

Possible values depend on the selected loss function:
- MultiLogloss — Only {0, 1} or {False, True} values are allowed that specify whether an object belongs to the class corresponding to the first index.
- MultiCrossEntropy — Numerical values in the range [0; 1] that are interpreted as the probability that the dataset object belongs to the class corresponding to the first index.

Note

Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.

Possible types

list
numpy.ndarray
pandas.DataFrame
pandas.Series

Default value

None

Supported processing units

CPU and GPU

cv

Description

The cross-validation splitting strategy.
The interpretation of this parameter depends on the input data type:

None — Use the default three-fold cross-validation.
int — The number of folds in a (Stratified)KFold
object — One of the scikit-learn Splitter Classes with the split method.
An iterable yielding train and test splits as arrays of indices.

Possible types

int
scikit-learn splitter object
cross-validation generator
iterable

bool

Default value

True

search_by_train_test_split

Description

Split the source dataset into train and test parts. Models are trained on the train part, while parameters are compared by the loss function score on the test dataset.

Description

The purpose of this parameter depends on the type of the given value:

int — The frequency of iterations to print the information to stdout.
bool — Print the information to stdout on every iteration (if set to “True”) or disable any logging (if set to “False”).

Possible types

bool
int

Default value

True

plot

Description

Draw train and evaluation metrics for every set of parameters in Jupyter Jupyter Notebook.

Possible types

bool

Default value
False

log_cout

Output stream or callback for logging.

Possible types

callable Python object
python object providing the write() method

Default value

sys.stdout

log_cerr

Error stream or callback for logging.

Possible types

callable Python object
python object providing the write() method

Default value

sys.stderr

Return value

Dict with two fields:

params — dict of best-found parameters.
cv_results — dict or pandas.core.frame.DataFrame with cross-validation results. Сolumns are: test-error-mean, test-error-std, train-error-mean, train-error-std.

Examples

from catboost import CatBoost

train_data = [[1, 4, 5, 6],
              [4, 5, 6, 7],
              [30, 40, 50, 60],
              [20, 30, 70, 60],
              [10, 80, 40, 30],
              [10, 10, 20, 30]]
train_labels = [10, 20, 30, 15, 10, 25]
model = CatBoost()

grid = {'learning_rate': [0.03, 0.1],
        'depth': [4, 6, 10],
        'l2_leaf_reg': [1, 3, 5, 7, 9]}

randomized_search_result = model.randomized_search(grid,
                                                   X=train_data,
                                                   y=train_labels,
                                                   plot=True)

The following is a chart plotted with Jupyter Notebook for the given example.

randomized_search

Method call format

Parameters

param_distributions

Description

X

Description

y

Description

cv

Description

n_iter

Description

partition_random_seed

Description

calc_cv_statistics

Description

search_by_train_test_split

Description

refit

Description

shuffle

Description

stratified

Description

train_size

Description

verbose

Description

plot

Description

log_cout

log_cerr

Return value

Examples

Was the article helpful?