sample_gaussian_process
Purpose
Implementation of Gaussian process sampling (Kernel Gradient Boosting/Algorithm 4) from "Gradient Boosting Performs Gaussian Process Inference" paper.
Produces samples from posterior GP with prior assumption
Method call format
sample_gaussian_process(X,
y,
eval_set=None,
cat_features=None,
text_features=None,
embedding_features=None,
random_seed=None,
samples=10,
posterior_iterations=900,
prior_iterations=100,
learning_rate=0.1,
depth=6,
sigma=0.1,
delta=0,
random_strength=0.1,
random_score_type='Gumbel',
eps=1e-4,
verbose=False)
Parameters
X
Description
Training data with features.
Must be non-empty (contain > 0 objects)
Possible types
list, numpy.ndarray, pandas.DataFrame, pandas.Series
Two-dimensional feature matrix.
pandas.SparseDataFrame, scipy.sparse.spmatrix (all subclasses except dia_matrix)
Two-dimensional sparse feature matrix.
catboost.FeaturesData
Special class for features data. See FeaturesData.
Default value
Required parameter
y
Description
Labels of the training data.
Must be a single-dimensional array with numerical values.
Possible types
- list
- numpy.ndarray
- pandas.Series
Default value
Required parameter
eval_set
Description
The validation dataset or datasets used for the following processes in the posterior fitting:
- overfitting detector
- monitoring metrics' changes
Possible types
- catboost.Pool
- list of catboost.Pool
- tuple (X, y)
- list of tuples (X, y)
- string (path to the dataset file)
- list of strings (paths to dataset files)
Default value
None
cat_features
Description
A one-dimensional array of categorical features columns indices.
Possible types
- list
- numpy.ndarray
Default value
None (all features are either considered numerical or of other types if specified precisely)
text_features
Description
A one-dimensional array of text features columns indices.
Possible types
- list
- numpy.ndarray
Default value
None (all features are either considered numerical or of other types if specified precisely)
embedding_features
Description
A one-dimensional array of embedding features columns indices.
Possible types
- list
- numpy.ndarray
Default value
None (all features are either considered numerical or of other types if specified precisely)
random_seed
Description
The random seed used for training.
Possible types
- int
Default value
Default value
None (all features are either considered numerical or of other types if specified precisely)
samples
Description
Number of Monte-Carlo samples from GP posterior. Controls how many models this function will return.
Possible range is [1, +inf)
Possible types
- int
Default value
10
posterior_iterations
Description
Max count of trees for posterior sampling step.
Possible range is [1, +inf)
Possible types
- int
Default value
900
prior_iterations
Description
Max count of trees for prior sampling step.
Possible range is [1, +inf)
Possible types
- int
Default value
100
learning_rate
Description
Step size shrinkage used in update to prevent overfitting.
Possible range is (0, 1]
Possible types
- float
Default value
0.1
depth
Description
Depth of the trees in the models.
Possible range is [1, 16]
Possible types
- int
Default value
6
sigma
Description
Scale of GP kernel (lower values lead to lower posterior variance).
Possible range is (0, +inf)
Possible types
- float
Default value
0.1
delta
Description
Scale of homogenious noise of GP kernel (adjust if target is noisy)
Possible range is [0, +inf)
Possible types
- float
Default value
0.0
random_strength
Description
Corresponds to parameter beta
in the paper. Higher values lead to faster convergence to GP posterior.
Possible range is (0, +inf)
Possible types
- float
Default value
0.1
random_score_type
Description
Type of random noise added to scores.
Possible values:
Gumbel
- Gumbel-distributed (as in paper)NormalWithModelSizeDecrease
- Normally-distributed with deviation decreasing with model iteration count (default in CatBoost)
Possible types
- string
Default value
Gumbel
eps
Description
Technical parameter that controls the precision of prior estimation.
Possible range is (0, 1]
Possible types
- float
Default value
1.e-4
verbose
Description
Verbosity of posterior model training output
If verbose
is bool
, then if set to True
, logging_level
is set to Verbose
,
if set to False
, logging_level
is set to Silent
.
If verbose
is int
, it determines the frequency of writing metrics to output and
logging_level
is set to Verbose
.
Possible types
- bool
- int
Default value
False
Return value
List of trained CatBoostRegressor models (size = samples parameter value).