Pool
class Pool(data,
label=None,
cat_features=None,
text_features=None,
embedding_features=None,
column_description=None,
pairs=None,
graph=None,
delimiter='\t',
has_header=False,
weight=None,
group_id=None,
group_weight=None,
subgroup_id=None,
pairs_weight=None,
baseline=None,
timestamp=None,
feature_names=None,
thread_count=-1,
log_cout=sys.stdout,
log_cerr=sys.stderr)
Purpose
Dataset processing.
The fastest way to pass the features data to the Pool constructor (and other CatBoost, CatBoostClassifier, CatBoostRegressor methods that accept it) if most (or all) of your features are numerical is to pass it using FeaturesData class. Another way to get similar performance with datasets that contain numerical features only is to pass features data as numpy.ndarray with numpy.float32 dtype.
Parameters
data
Description
The description is different for each group of possible types.
Possible types
list, numpy.ndarray, pandas.DataFrame, pandas.Series
Dataset in the form of a two-dimensional feature matrix.
pandas.SparseDataFrame, scipy.sparse.spmatrix (all subclasses except dia_matrix)
The input training dataset in the form of a two-dimensional sparse feature matrix.
catboost.FeaturesData
Dataset in the form of catboost.FeaturesData. The fastest way to create a Pool from Python objects.
string
The path to the input file that contains the dataset description.
Format:
[scheme://]<path>
-
scheme
(optional) defines the type of the input dataset. Possible values:quantized://
 — catboost.Pool quantized pool.libsvm://
 — dataset in the extended libsvm format.
If omitted, a dataset in the Native CatBoost Delimiter-separated values format is expected.
-
path
defines the path to the dataset description.
Default value
Required parameter
label
Description
The target variables (in other words, the objects' label values).
Must be in the form of a one- or two- dimensional array. The type of data in the array depends on the machine learning task being solved:
-
Regression and ranking — One-dimensional array of numeric values.
-
Multiregression - Two-dimensional array of numeric values. The first index is for a dimension, the second index is for an object.
-
Binary classification
One-dimensional array containing one of:-
Booleans, integers or strings that represent the labels of the classes (only two unique values).
-
Numeric values.
The interpretation of numeric values depends on the selected loss function:- Logloss — The value is considered a positive class if it is strictly greater than the value of theÂ
target_border
training parameter. Otherwise, it is considered a negative class. - CrossEntropy — The value is interpreted as the probability that the dataset object belongs to the positive class. Possible values are in the range
[0; 1]
.
- Logloss — The value is considered a positive class if it is strictly greater than the value of theÂ
-
-
Multiclassification — One-dimensional array of integers or strings that represent the labels of the classes.
-
Multi label classification
Two-dimensional array. The first index is for a label/class, the second index is for an object.Possible values depend on the selected loss function:
- MultiLogloss — Only {0, 1} or {False, True} values are allowed that specify whether an object belongs to the class corresponding to the first index.
- MultiCrossEntropy — Numerical values in the range
[0; 1]
that are interpreted as the probability that the dataset object belongs to the class corresponding to the first index.
Note
If data
parameter points to a file, label data is loaded from it as well. This parameter must be None
in this case.
Possible types
- list
- numpy.ndarray
- pandas.Series
- pandas.DataFrame
Default value
None
cat_features
Description
A one-dimensional array of categorical columns indices (specified as integers) or names (specified as strings).
Use only if the data
parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).
If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names
parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data
parameter.
Possible types
- list
- numpy.ndarray
Default value
None (it is assumed that all columns are the values of numerical features)
text_features
Description
A one-dimensional array of text columns indices (specified as integers) or names (specified as strings).
Use only if the data
parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).
If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names
parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data
parameter.
Possible types
- list
- numpy.ndarray
Default value
Default value
None (all features are either considered numerical or of other types if specified precisely)
embedding_features
Description
A one-dimensional array of embedding columns indices (specified as integers) or names (specified as strings).
Use only if the data
parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).
If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names
parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data
parameter.
Possible types
- list
- numpy.ndarray
Default value
Default value
None (all features are either considered numerical or of other types if specified precisely)
column_description
Description
The path to the input file that contains the columns description.
Possible types
string
Default value
None
pairs
Description
The description is different for each group of possible types.
Possible types
list, numpy.ndarray, pandas.DataFrame
The pairs description in the form of a two-dimensional matrix of shape N
by 2:
N
is the number of pairs.- The first element of the pair is the zero-based index of the winner object from the input dataset for pairwise comparison.
- The second element of the pair is the zero-based index of the loser object from the input dataset for pairwise comparison.
This information is used for calculation and optimization of Pairwise metrics.
string
The path to the input file that contains the pairs description.
This information is used for calculation and optimization of Pairwise metrics.
graph
Description
The description is different for each group of possible types.
Possible types
list, numpy.ndarray, pandas.DataFrame
The graph description in the form of a two-dimensional matrix of shape N
by 2:
N
is the number of edges.- The first element of the edge is the zero-based index of start vertex (object) from the input dataset.
- The second element of the edge is the zero-based index of end vertex (object) from the input dataset.
string
The path to the input file that contains the graph information.
Default value
None
delimiter
Description
The delimiter character used to separate the data in the dataset description input file.
Only single char delimiters are supported. If the specified value contains more than one character, only the first one is used.
Note
Used only if the dataset is given in the Delimiter-separated values format.
Possible types
CPU and GPU
Default value
The input data is assumed to be tab-separated
has_header
Description
Read the column names from the first line of the dataset description file if this parameter is set.
Note
Used only if the dataset is given in the Delimiter-separated values format.
Possible types
bool
Default value
False
weight
Description
The weight of each object in the input data in the form of a one-dimensional array-like data.
By default, it is set to 1 for all objects.
Alert
Only one of the following parameters can be used at a time:
weight
group_weight
Possible types
- list
- numpy.ndarray
Default value
None
group_weight
Description
The weights of all objects within the defined groups from the input data in the form of one-dimensional array-like data.
Used for calculating the final values of trees. By default, it is set to 1 for all objects in all groups.
Alert
Only one of the following parameters can be used at a time:
weight
group_weight
Possible types
- list
- numpy.ndarray
Default value
None
group_id
Description
Group identifiers for all input objects. Supported identifier types are:
- int
- string types (string or unicode for Python 2 and bytes or string for Python 3).
Warning
All objects in the dataset must be grouped by group identifiers if they are present. I.e., the objects with the same group identifier should follow each other in the dataset.
Example
For example, let's assume that the dataset consists of documents . The corresponding groups are , respectively. The feature vectors for the given documents are respectively. Then the dataset can take the following form:
The grouped blocks of lines can be input in any order. For example, the following order is equivalent to the previous one:
Possible types
- list
- numpy.ndarray
Default value
None
subgroup_id
Description
Subgroup identifiers for all input objects. Supported identifier types are:
- int
- string types (string or unicode for Python 2 and bytes or string for Python 3).
Possible types
- list
- numpy.ndarray
Default value
None
pairs_weight
Description
The weight of each input pair of objects in the form of one-dimensional array-like pairs. The number of given values must match the number of specified pairs.
This information is used for calculation and optimization of Pairwise metrics.
By default, it is set to 1 for all pairs.
Possible types
- list
- numpy.ndarray
Default value
None
baseline
Description
Array of formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.
Possible types
- list
- numpy.ndarray
Default value
None
timestamp
Description
Timestamps for all input objects.
Should contain non-negative integer values.
Useful for sorting a learning dataset by this field during training.
Possible types
- list
- numpy.ndarray
Default value
None
feature_names
Description
A list of names for each feature in the dataset.
Possible types
list
Default value
None
thread_count
Description
The number of threads to use when reading data from file.
Use only when the dataset is read from an input file.
Possible types
int
Default value
-1 (the number of threads is equal to the number of processor cores)
log_cout
Output stream or callback for logging.
Possible types
- callable Python object
- python object providing the
write()
method
Default value
sys.stdout
log_cerr
Error stream or callback for logging.
Possible types
- callable Python object
- python object providing the
write()
method
Default value
sys.stderr
Attributes
Attribute: Attributes
Description: Return the shape of the dataset.
Attribute: Attributes
Description:
Indicates that an empty array was input.
Methods
Method: get_baseline
Description
Return an array of baselines from the dataset.
Method: get_cat_feature_indices
Description
Return the indices of categorical features found in the input data.
Method: get_embedding_feature_indices
Description
Return the indices of embedding features found in the input data.
Method: get_features
Description
Return an array of the dataset features
Method: get_group_id
Description
Return an array of group identifiers for all objects.
Method: get_label
Description
Return the value of the label assigned to the input data.
Method: get_text_feature_indices
Description
Return the indices of text features found in the input data.
Method: get_weight
Description
Return the list of weights for each object of the dataset.
Method: is_quantized
Description
Check whether the pool is quantized.
Method: num_col
Description
Return the number of columns that contain feature data.
Method: num_row
Description
Return the number of objects contained in the dataset.
Method: quantize
Description
Quantize the given pool.
Method: save
Description
Save the quantized pool to a file.
Method: save_quantization_borders
Description
Save borders used in the numeric features' quantization to a file.
Refer to the Custom quantization borders and missing value modes section for details on the output file's format.
Method: set_baseline
Description
Set initial formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.
Method: set_feature_names
Description
Set names for all features in the dataset.
Method: set_group_id
Description
Set identifiers for all input objects.
Method: set_group_weight
Description
Set weights for all objects within the defined group.
Method: set_pairs
Description
Set the list of pairs for Pairwise metrics.
Method: set_pairs_weight
Description
Set weights for each pair of objects.
Method: set_subgroup_id
Description
Set subgroup identifiers for all input objects.
Method: set_weight
Description
Set weights for all input objects.
Method: slice
Description
Form a slice of the input dataset from the given list of object indices.
Usage examples
Pool, train it with CatBoostClassifier and make a prediction
Load the dataset usingfrom catboost import CatBoostClassifier, Pool
train_data = Pool(data=[[1, 4, 5, 6],
[4, 5, 6, 7],
[30, 40, 50, 60]],
label=[1, 1, -1],
weight=[0.1, 0.2, 0.3])
model = CatBoostClassifier(iterations=10)
model.fit(train_data)
preds_class = model.predict(train_data)