# Pool

class Pool(data,
label=None,
cat_features=None,
text_features=None,
embedding_features=None,
column_description=None,
pairs=None,
delimiter='\t',
weight=None,
group_id=None,
group_weight=None,
subgroup_id=None,
pairs_weight=None,
baseline=None,
timestamp=None,
feature_names=None,
log_cout=sys.stdout,
log_cerr=sys.stderr)


## Purpose

Dataset processing.

The fastest way to pass the features data to the Pool constructor (and other CatBoost, CatBoostClassifier, CatBoostRegressor methods that accept it) if most (or all) of your features are numerical is to pass it using FeaturesData class. Another way to get similar performance with datasets that contain numerical features only is to pass features data as numpy.ndarray with numpy.float32 dtype.

## Parameters

### data

#### Description

The description is different for each group of possible types.

Possible types

list, numpy.array, pandas.DataFrame, pandas.Series

Dataset in the form of a two-dimensional feature matrix.

pandas.SparseDataFrame, scipy.sparse.spmatrix (all subclasses except dia_matrix)

The input training dataset in the form of a two-dimensional sparse feature matrix.

catboost.FeaturesData

Dataset in the form of catboost.FeaturesData. The fastest way to create a Pool from Python objects.

Format:

[scheme://]<path>

string

The path to the input file that contains the dataset description.

Default value

Required parameter

### label

#### Description

The target variables (in other words, the objects' label values) for the training dataset.

Must be in the form of a one-dimensional array. The type of data in the array depends on the machine learning task being solved:

• Regression , multiregression and ranking  — Numeric values.

• Binary classification — Numeric values.

The interpretation of numeric values depends on the selected loss function:

• Logloss — The value is considered a positive class if it is strictly greater than the value of the  parameter of the loss function. Otherwise, it is considered a negative class.
• CrossEntropy — The value is interpreted as the probability that the dataset object belongs to the positive class. Possible values are in the range [0; 1].
• Multiclassification — Integers or strings that represents the labels of the classes.

Possible types

• list
• numpy.array
• pandas.Series
• pandas.DataFrame

Default value

None

### cat_features

#### Description

A one-dimensional array of categorical columns indices (specified as integers) or names (specified as strings).

Use only if the data parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).

If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data parameter.

Possible types

• list
• numpy.array

Default value

None (it is assumed that all columns are the values of numerical features)

### text_features

#### Description

A one-dimensional array of text columns indices (specified as integers) or names (specified as strings).

Use only if the data parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).

If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data parameter.

Possible types

• list
• numpy.array

Default value

Default value

None (all features are either considered numerical or of other types if specified precisely)

### embedding_features

#### Description

A one-dimensional array of embedding columns indices (specified as integers) or names (specified as strings).

Use only if the data parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).

If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data parameter.

Possible types

• list
• numpy.array

Default value

Default value

None (all features are either considered numerical or of other types if specified precisely)

### column_description

#### Description

The path to the input file that contains the columns description.

Possible types

string

Default value

None

### pairs

#### Description

The description is different for each group of possible types.

Possible types

list, numpy.array, pandas.DataFrame

The pairs description in the form of a two-dimensional matrix of shape N by 2:

• N is the number of pairs.
• The first element of the pair is the zero-based index of the winner object from the input dataset for pairwise comparison.
• The second element of the pair is the zero-based index of the loser object from the input dataset for pairwise comparison.

This information is used for calculation and optimization of Pairwise metrics.

string

The path to the input file that contains the pairs description.

This information is used for calculation and optimization of Pairwise metrics.

Default value

None

### delimiter

#### Description

The delimiter character used to separate the data in the dataset description input file.

Only single char delimiters are supported. If the specified value contains more than one character, only the first one is used.

Note

Used only if the dataset is given in the Delimiter-separated values format.

Possible types

CPU and GPU

Default value

The input data is assumed to be tab-separated

#### Description

Read the column names from the first line of the dataset description file if this parameter is set.

Note

Used only if the dataset is given in the Delimiter-separated values format.

Possible types

bool

Default value

False

### weight

#### Description

The weight of each object in the input data in the form of a one-dimensional array-like data.

By default, it is set to 1 for all objects.

Only one of the following parameters can be used at a time:

• weight
• group_weight

Possible types

• list
• numpy.array

Default value

None

### group_weight

#### Description

The weights of all objects within the defined groups from the input data in the form of one-dimensional array-like data.

Used for calculating the final values of trees. By default, it is set to 1 for all objects in all groups.

Only one of the following parameters can be used at a time:

• weight
• group_weight

Possible types

• list
• numpy.array

Default value

None

### group_id

#### Description

Group identifiers for all input objects. Supported identifier types are:

• int
• string types (string or unicode for Python 2 and bytes or string for Python 3).

Warning

All objects in the dataset must be grouped by group identifiers if they are present. I.e., the objects with the same group identifier should follow each other in the dataset.

Example

For example, let's assume that the dataset consists of documents $d_{1}, d_{2}, d_{3}, d_{4}, d_{5}$. The corresponding groups are $g_{1}, g_{2}, g_{3}, g_{2}, g_{2}$, respectively. The feature vectors for the given documents are $f_{1}, f_{2}, f_{3}, f_{4}, f_{5}$ respectively. Then the dataset can take the following form:

$\begin{pmatrix} d_{2}&g_{2}&f_{2}\\ d_{4}&g_{2}&f_{4}\\ d_{5}&g_{2}&f_{5}\\ d_{3}&g_{3}&f_{3}\\ d_{1}&g_{1}&f_{1} \end{pmatrix}$

The grouped blocks of lines can be input in any order. For example, the following order is equivalent to the previous one:

$\begin{pmatrix} d_{1}&g_{1}&f_{1}\\ d_{3}&g_{3}&f_{3}\\ d_{2}&g_{2}&f_{2}\\ d_{4}&g_{2}&f_{4}\\ d_{5}&g_{2}&f_{5} \end{pmatrix}$

Possible types

• list
• numpy.array

Default value

None

### subgroup_id

#### Description

Subgroup identifiers for all input objects. Supported identifier types are:

• int
• string types (string or unicode for Python 2 and bytes or string for Python 3).

Possible types

• list
• numpy.array

Default value

None

### pairs_weight

#### Description

The weight of each input pair of objects in the form of one-dimensional array-like pairs. The number of given values must match the number of specified pairs.

This information is used for calculation and optimization of Pairwise metrics.

By default, it is set to 1 for all pairs.

Possible types

• list
• numpy.array

Default value

None

### baseline

#### Description

Array of formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.

Possible types

• list
• numpy.array

Default value

None

### timestamp

#### Description

Timestamps for all input objects.
Should contain non-negative integer values.
Useful for sorting a learning dataset by this field during training.

Possible types

• list
• numpy.array

Default value

None

### feature_names

#### Description

A list of names for each feature in the dataset.

Possible types

list

Default value

None

#### Description

Use only when the dataset is read from an input file.

Possible types

int

Default value

-1 (the number of threads is equal to the number of processor cores)

### log_cout

Output stream or callback for logging.

Possible types

• callable Python object
• python object providing the write() method

Default value

sys.stdout

### log_cerr

Error stream or callback for logging.

Possible types

• callable Python object
• python object providing the write() method

Default value

sys.stderr

## Attributes

Attribute: Attributes

Description: Return the shape of the dataset.

Attribute: Attributes

Description:

Indicates that an empty array was input.

## Methods

Method: get_baseline

#### Description

Return an array of baselines from the dataset.

Method: get_cat_feature_indices

#### Description

Return the indices of categorical features found in the input data.

#### Description

Return the indices of embedding features found in the input data.

Method: get_features

#### Description

Return an array of the dataset features

Method: get_group_id

#### Description

Return an array of group identifiers for all objects.

Method: get_label

#### Description

Return the value of the label assigned to the input data.

Method: get_text_feature_indices

#### Description

Return the indices of text features found in the input data.

Method: get_weight

#### Description

Return the list of weights for each object of the dataset.

Method: is_quantized

#### Description

Check whether the pool is quantized.

Method: num_col

#### Description

Return the number of columns that contain feature data.

Method: num_row

#### Description

Return the number of objects contained in the dataset.

Method: quantize

#### Description

Quantize the given pool.

Method: save

#### Description

Save the quantized pool to a file.

Method: save_quantization_borders

#### Description

Save borders used in the numeric features' quantization to a file.

Refer to the Custom quantization borders and missing value modes section for details on the output file's format.

Method: set_baseline

#### Description

Set initial formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.

Method: set_feature_names

#### Description

Set names for all features in the dataset.

Method: set_group_id

#### Description

Set identifiers for all input objects.

Method: set_group_weight

#### Description

Set weights for all objects within the defined group.

Method: set_pairs

#### Description

Set the list of pairs for Pairwise metrics.

Method: set_pairs_weight

#### Description

Set weights for each pair of objects.

Method: set_subgroup_id

#### Description

Set subgroup identifiers for all input objects.

Method: set_weight

#### Description

Set weights for all input objects.

Method: slice

#### Description

Form a slice of the input dataset from the given list of object indices.

## Usage examples

#### Load the dataset using Pool, train it with CatBoostClassifier and make a prediction

from catboost import CatBoostClassifier, Pool

train_data = Pool(data=[[1, 4, 5, 6],
[4, 5, 6, 7],
[30, 40, 50, 60]],
label=[1, 1, -1],
weight=[0.1, 0.2, 0.3])

model = CatBoostClassifier(iterations=10)

model.fit(train_data)
preds_class = model.predict(train_data)