sample_gaussian_process

Purpose

Implementation of Gaussian process sampling (Kernel Gradient Boosting/Algorithm 4) from "Gradient Boosting Performs Gaussian Process Inference" paper.

Produces samples from posterior GP with prior assumption $f \sim \mathcal{GP}(0, \sigma^2 \mathcal{K} + \delta^2 I)$

Method call format

sample_gaussian_process(X,
                        y,
                        eval_set=None,
                        cat_features=None,
                        text_features=None,
                        embedding_features=None,
                        random_seed=None,
                        samples=10,
                        posterior_iterations=900,
                        prior_iterations=100,
                        learning_rate=0.1,
                        depth=6,
                        sigma=0.1,
                        delta=0,
                        random_strength=0.1,
                        random_score_type='Gumbel',
                        eps=1e-4,
                        verbose=False)

Parameters

X

Description

Training data with features.
Must be non-empty (contain > 0 objects)

Possible types

list, numpy.ndarray, pandas.DataFrame, pandas.Series

Two-dimensional feature matrix.

pandas.SparseDataFrame, scipy.sparse.spmatrix (all subclasses except dia_matrix)

Two-dimensional sparse feature matrix.

catboost.FeaturesData

Special class for features data. See FeaturesData.

Default value

Required parameter

y

Description

Labels of the training data.
Must be a single-dimensional array with numerical values.

Possible types

list
numpy.ndarray
pandas.Series

Default value

Required parameter

eval_set

Description

The validation dataset or datasets used for the following processes in the posterior fitting:

overfitting detector
monitoring metrics' changes

Possible types

catboost.Pool
list of catboost.Pool
tuple (X, y)
list of tuples (X, y)
string (path to the dataset file)
list of strings (paths to dataset files)

Default value

None

cat_features

Description

A one-dimensional array of categorical features columns indices.

Possible types

list
numpy.ndarray

Default value

None (all features are either considered numerical or of other types if specified precisely)

text_features

Description

A one-dimensional array of text features columns indices.

Possible types

list
numpy.ndarray

Default value

None (all features are either considered numerical or of other types if specified precisely)

embedding_features

Description

A one-dimensional array of embedding features columns indices.

Possible types

list
numpy.ndarray

Default value

None (all features are either considered numerical or of other types if specified precisely)

random_seed

Description

The random seed used for training.

Possible types

Default value

None (all features are either considered numerical or of other types if specified precisely)

samples

Description

Number of Monte-Carlo samples from GP posterior. Controls how many models this function will return.

Possible range is [1, +inf)

Possible types

Default value

posterior_iterations

Description

Max count of trees for posterior sampling step.

Possible range is [1, +inf)

Possible types

Default value

900

prior_iterations

Description

Max count of trees for prior sampling step.

Possible range is [1, +inf)

Possible types

Default value

100

learning_rate

Description

Step size shrinkage used in update to prevent overfitting.

Possible range is (0, 1]

Possible types

float

Default value

0.1

depth

Description

Depth of the trees in the models.

Possible range is [1, 16]

Possible types

Default value

sigma

Description

Scale of GP kernel (lower values lead to lower posterior variance).

Possible range is (0, +inf)

Possible types

float

Default value

0.1

delta

Description

Scale of homogenious noise of GP kernel (adjust if target is noisy)

Possible range is [0, +inf)

Possible types

float

Default value

0.0

random_strength

Description

Corresponds to parameter beta in the paper. Higher values lead to faster convergence to GP posterior.

Possible range is (0, +inf)

Possible types

float

Default value

0.1

random_score_type

Description

Type of random noise added to scores.
Possible values:

Gumbel - Gumbel-distributed (as in paper)
NormalWithModelSizeDecrease - Normally-distributed with deviation decreasing with model iteration count (default in CatBoost)

Possible types

string

Default value

Gumbel

eps

Description

Technical parameter that controls the precision of prior estimation.

Possible range is (0, 1]

Possible types

float

Default value

1.e-4

verbose

Description

Verbosity of posterior model training output
If verbose is bool, then if set to True, logging_level is set to Verbose,
if set to False, logging_level is set to Silent.
If verbose is int, it determines the frequency of writing metrics to output and
logging_level is set to Verbose.

Possible types

bool
int

Default value

False

Return value

List of trained CatBoostRegressor models (size = samples parameter value).

Was the article helpful?

Pool initialization

sum_models