sample_gaussian_process

Purpose

Implementation of Gaussian process sampling (Kernel Gradient Boosting/Algorithm 4) from "Gradient Boosting Performs Gaussian Process Inference" paper.

Produces samples from posterior GP with prior assumption fGP(0,σ2K+δ2I)f \sim \mathcal{GP}(0, \sigma^2 \mathcal{K} + \delta^2 I)

Method call format

sample_gaussian_process(X,
                        y,
                        eval_set=None,
                        cat_features=None,
                        text_features=None,
                        embedding_features=None,
                        random_seed=None,
                        samples=10,
                        posterior_iterations=900,
                        prior_iterations=100,
                        learning_rate=0.1,
                        depth=6,
                        sigma=0.1,
                        delta=0,
                        random_strength=0.1,
                        random_score_type='Gumbel',
                        eps=1e-4,
                        verbose=False)

Parameters

X

Description

Training data with features.
Must be non-empty (contain > 0 objects)

Possible types

list, numpy.ndarray, pandas.DataFrame, pandas.Series

Two-dimensional feature matrix.

pandas.SparseDataFrame, scipy.sparse.spmatrix (all subclasses except dia_matrix)

Two-dimensional sparse feature matrix.

catboost.FeaturesData

Special class for features data. See FeaturesData.

Default value

Required parameter

y

Description

Labels of the training data.
Must be a single-dimensional array with numerical values.

Possible types

  • list
  • numpy.ndarray
  • pandas.Series

Default value

Required parameter

eval_set

Description

The validation dataset or datasets used for the following processes in the posterior fitting:

Possible types

  • catboost.Pool
  • list of catboost.Pool
  • tuple (X, y)
  • list of tuples (X, y)
  • string (path to the dataset file)
  • list of strings (paths to dataset files)

Default value

None

cat_features

Description

A one-dimensional array of categorical features columns indices.

Possible types

  • list
  • numpy.ndarray

Default value

None (all features are either considered numerical or of other types if specified precisely)

text_features

Description

A one-dimensional array of text features columns indices.

Possible types

  • list
  • numpy.ndarray

Default value

None (all features are either considered numerical or of other types if specified precisely)

embedding_features

Description

A one-dimensional array of embedding features columns indices.

Possible types

  • list
  • numpy.ndarray

Default value

None (all features are either considered numerical or of other types if specified precisely)

random_seed

Description

The random seed used for training.

Possible types

  • int

Default value

Default value

None (all features are either considered numerical or of other types if specified precisely)

samples

Description

Number of Monte-Carlo samples from GP posterior. Controls how many models this function will return.

Possible range is [1, +inf)

Possible types

  • int

Default value

10

posterior_iterations

Description

Max count of trees for posterior sampling step.

Possible range is [1, +inf)

Possible types

  • int

Default value

900

prior_iterations

Description

Max count of trees for prior sampling step.

Possible range is [1, +inf)

Possible types

  • int

Default value

100

learning_rate

Description

Step size shrinkage used in update to prevent overfitting.

Possible range is (0, 1]

Possible types

  • float

Default value

0.1

depth

Description

Depth of the trees in the models.

Possible range is [1, 16]

Possible types

  • int

Default value

6

sigma

Description

Scale of GP kernel (lower values lead to lower posterior variance).

Possible range is (0, +inf)

Possible types

  • float

Default value

0.1

delta

Description

Scale of homogenious noise of GP kernel (adjust if target is noisy)

Possible range is [0, +inf)

Possible types

  • float

Default value

0.0

random_strength

Description

Corresponds to parameter beta in the paper. Higher values lead to faster convergence to GP posterior.

Possible range is (0, +inf)

Possible types

  • float

Default value

0.1

random_score_type

Description

Type of random noise added to scores.
Possible values:

  • Gumbel - Gumbel-distributed (as in paper)
  • NormalWithModelSizeDecrease - Normally-distributed with deviation decreasing with model iteration count (default in CatBoost)

Possible types

  • string

Default value

  • Gumbel

eps

Description

Technical parameter that controls the precision of prior estimation.

Possible range is (0, 1]

Possible types

  • float

Default value

1.e-4

verbose

Description

Verbosity of posterior model training output
If verbose is bool, then if set to True, logging_level is set to Verbose,
if set to False, logging_level is set to Silent.
If verbose is int, it determines the frequency of writing metrics to output and
logging_level is set to Verbose.

Possible types

  • bool
  • int

Default value

False

Return value

List of trained CatBoostRegressor models (size = samples parameter value).