quantize

Method call format
Parameters
Type of return value
Usage examples

Create a pool from a file and quantize it while loading the data. This compresses the size of the initial dataset and provides an opportunity to load huge datasets that can not be loaded to RAM otherwise.

The input data should contain only numerical features (other types are not currently supported).

This method gives an identical result to implementing the following code but is less RAM consuming:

pool = Pool(filename, **some_pool_load_params)
pool.quantize(**some_quantization_params)
return pool

Method call format

quantize(data_path,
         column_description=None,
         pairs=None,
         delimiter='\t',
         has_header=False,
         feature_names=None,
         thread_count=-1,
         ignored_features=None,
         per_float_feature_quantization=None,
         border_count=None,
         max_bin=None,
         feature_border_type=None,
         nan_mode=None,
         input_borders=None,
         task_type=None,
         used_ram_limit=None,
         random_seed=None)

Parameters

data_path

Description

The path to the input file that contains the dataset description.

Format:

[scheme://]<path>

scheme (optional) defines the type of the input dataset. Possible values:
- quantized:// — catboost.Pool quantized pool.
- libsvm:// — dataset in the extended libsvm format.
If omitted, a dataset in the Native CatBoost Delimiter-separated values format is expected.
path defines the path to the dataset description.

Possible types

string

Default value

Obligatory parameter

column_description

Description

The path to the input file that contains the columns description.

Possible types

string

Default value

None

pairs

Description

The path to the input file that contains the pairs description.

This information is used for calculation and optimization of Pairwise metrics.

Possible types

string

Default value

None

delimiter

Description

The delimiter character used to separate the data in the dataset description input file.

Only single char delimiters are supported. If the specified value contains more than one character, only the first one is used.

Note

Used only if the dataset is given in the Delimiter-separated values format.

Possible types

The input data is assumed to be tab-separated

Default values

CPU and GPU

has_header

Description

Read the column names from the first line of the dataset description file if this parameter is set.

Note

Used only if the dataset is given in the Delimiter-separated values format.

Possible types

bool

Default value

False

feature_names

Description

A list of names for each feature in the dataset.

Possible types

list

Default value

None

thread_count

Description

The number of threads to use.

Optimizes the speed of execution. This parameter doesn't affect results.

Possible types

int

Default values

-1 (the number of threads is equal to the number of processor cores)

ignored_features

Description

Feature indices or names to exclude from the training. It is assumed that all passed values are feature names if at least one of the passed values can not be converted to a number or a range of numbers. Otherwise, it is assumed that all passed values are feature indices.

Specifics:

Non-negative indices that do not match any features are successfully ignored. For example, if five features are defined for the objects in the dataset and this parameter is set to 42, the corresponding non-existing feature is successfully ignored.
The identifier corresponds to the feature's index. Feature indices used in train and feature importance are numbered from 0 to featureCount – 1. If a file is used as input data then any non-feature column types are ignored when calculating these indices. For example, each row in the input file contains data in the following order: cat feature<\t>label value<\t>num feature. So for the row rock<\t>0<\t>42, the identifier for the rock feature is 0, and for the 42 feature it's 1.
The addition of a non-existing feature name raises an error.

For example, use the following construction if features indexed 1, 2, 7, 42, 43, 44, 45, should be ignored:

[1,2,7,42,43,44,45]

Possible types

list

Default value

None

per_float_feature_quantization

Description

The quantization description for the specified feature or list of features.

Description format for a single feature:

FeatureId[:border_count=BorderCount][:nan_mode=BorderType][:border_type=border_selection_method]

Example:

per_float_feature_quantization=['0:border_count=1024', '1:border_count=1024']

In this example, features indexed 0 and 1 have 1024 borders.

Possible types

list of strings

Default value

GreedyLogSum

border_count

Alias:max_bin

Description

The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively.

Possible types

int

Default value

The default value depends on the processing unit type:

CPU: 254
GPU: 128

feature_border_type

Description

The quantization mode for numerical features.

Possible values:

Median
Uniform
UniformAndQuantiles
MaxLogSum
MinEntropy
GreedyLogSum

Possible types

string

Default value

GreedyLogSum

nan_mode

Description

The method for processing missing values in the input dataset.

Possible values:

"Forbidden" — Missing values are not supported, their presence is interpreted as an error.
"Min" — Missing values are processed as the minimum value (less than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.
"Max" — Missing values are processed as the maximum value (greater than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.

Using the Min or Max value of this parameter guarantees that a split between missing values and other values is considered when selecting a new split in the tree.

Note

The method for processing missing values can be set individually for each feature in the Custom quantization borders and missing value modes input file. Such values override the ones specified in this parameter.

Possible types

string

Default value

Min

input_borders

Description

Load Custom quantization borders and missing value modes from a file (do not generate them).

Borders are automatically generated before training if this parameter is not set.

Possible types

string

Default value

None

task_type

Description

The processing unit type to use for training.

Possible values:

Possible types

string

Default value

CPU

used_ram_limit

Description

Attempt to limit the amount of used CPU RAM.

Alert

This option affects only the CTR calculation memory usage.
In some cases it is impossible to limit the amount of CPU RAM used in accordance with the specified value.

Format:

<size><measure of information>

Supported measures of information (non case-sensitive):

For example:

2gb

Possible types

int

Default value

None (memory usage is no limited)

random_seed

Description

The random seed used for training.

Possible types

int

Default value

None (0)

Type of return value

catboost.Pool (a quantized pool)

Usage examples

The following is the input file with the dataset description:

The pool is created as follows:

from catboost.utils import quantize

quantized_pool=quantize(data_path="pool__utils__quantize_data")
print(type(quantized_pool))

The output of this example:

<class 'catboost.core.Pool'>

quantize

Method call formatMethod call format

ParametersParameters

data_pathdata_path

DescriptionDescription

column_descriptioncolumn_description

DescriptionDescription

pairspairs

DescriptionDescription

delimiterdelimiter

DescriptionDescription

has_headerhas_header

DescriptionDescription

feature_namesfeature_names

DescriptionDescription

thread_countthread_count

DescriptionDescription

ignored_featuresignored_features

DescriptionDescription

per_float_feature_quantizationper_float_feature_quantization

DescriptionDescription

border_countborder_count

DescriptionDescription

feature_border_typefeature_border_type

DescriptionDescription

nan_modenan_mode

DescriptionDescription

input_bordersinput_borders

DescriptionDescription

task_typetask_type

DescriptionDescription

used_ram_limitused_ram_limit

DescriptionDescription

random_seedrandom_seed

DescriptionDescription

Type of return valueType of return value

Usage examplesUsage examples

Was the article helpful?

Method call format

Parameters

data_path

Description

column_description

Description

pairs

Description

delimiter

Description

has_header

Description

feature_names

Description

thread_count

Description

ignored_features

Description

per_float_feature_quantization

Description

border_count

Description

feature_border_type

Description

nan_mode

Description

input_borders

Description

task_type

Description

used_ram_limit

Description

random_seed

Description

Type of return value

Usage examples