quantize

Create a pool from a file and quantize it while loading the data. This compresses the size of the initial dataset and provides an opportunity to load huge datasets that can not be loaded to RAM otherwise.

The input data should contain only numerical features (other types are not currently supported).

This method gives an identical result to implementing the following code but is less RAM consuming:

pool = Pool(filename, **some_pool_load_params)
pool.quantize(**some_quantization_params)
return pool

Method call format

quantize(data_path,
         column_description=None,
         pairs=None,
         delimiter='\t',
         has_header=False,
         feature_names=None,
         thread_count=-1,
         ignored_features=None,
         per_float_feature_quantization=None,
         border_count=None,
         max_bin=None,
         feature_border_type=None,
         nan_mode=None,
         input_borders=None,
         task_type=None,
         used_ram_limit=None,
         random_seed=None)

Parameters

data_path

Description

The path to the input file that contains the dataset description.

Format:

[scheme://]<path>

Possible types

string

Default value

Obligatory parameter

column_description

Description

The path to the input file that contains the columns description.

Possible types

string

Default value

None

pairs

Description

The path to the input file that contains the pairs description.

This information is used for calculation and optimization of Pairwise metrics.

Possible types

string

Default value

None

delimiter

Description

The delimiter character used to separate the data in the dataset description input file.

Only single char delimiters are supported. If the specified value contains more than one character, only the first one is used.

Note

Used only if the dataset is given in the Delimiter-separated values format.

Possible types

The input data is assumed to be tab-separated

Default values

CPU and GPU

has_header

Description

Read the column names from the first line of the dataset description file if this parameter is set.

Note

Used only if the dataset is given in the Delimiter-separated values format.

Possible types

bool

Default value

False

feature_names

Description

A list of names for each feature in the dataset.

Possible types

list

Default value

None

thread_count

Description

The number of threads to use.

Optimizes the speed of execution. This parameter doesn't affect results.

Possible types

int

Default values

-1 (the number of threads is equal to the number of processor cores)

ignored_features

Description

Feature indices or names to exclude from the training. It is assumed that all passed values are feature names if at least one of the passed values can not be converted to a number or a range of numbers. Otherwise, it is assumed that all passed values are feature indices.

Specifics:

  • Non-negative indices that do not match any features are successfully ignored. For example, if five features are defined for the objects in the dataset and this parameter is set to 42, the corresponding non-existing feature is successfully ignored.

  • The identifier corresponds to the feature's index. Feature indices used in train and feature importance are numbered from 0 to featureCount – 1. If a file is used as input data then any non-feature column types are ignored when calculating these indices. For example, each row in the input file contains data in the following order: cat feature<\t>label value<\t>num feature. So for the row rock<\t>0<\t>42, the identifier for the rock feature is 0, and for the 42 feature it's 1.

  • The addition of a non-existing feature name raises an error.

For example, use the following construction if features indexed 1, 2, 7, 42, 43, 44, 45, should be ignored:

[1,2,7,42,43,44,45]

Possible types

list

Default value

None

per_float_feature_quantization

Description

The quantization description for the specified feature or list of features.

Description format for a single feature:

FeatureId[:border_count=BorderCount][:nan_mode=BorderType][:border_type=border_selection_method]

Example:

per_float_feature_quantization=['0:border_count=1024', '1:border_count=1024']

In this example, features indexed 0 and 1 have 1024 borders.

Possible types

list of strings

Default value

GreedyLogSum

border_count

Alias:max_bin

Description

The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively.

Possible types

int

Default value

The default value depends on the processing unit type:

  • CPU: 254
  • GPU: 128

feature_border_type

Description

The quantization mode for numerical features.

Possible values:

  • Median
  • Uniform
  • UniformAndQuantiles
  • MaxLogSum
  • MinEntropy
  • GreedyLogSum

Possible types

string

Default value

GreedyLogSum

nan_mode

Description

The method for  processing missing values in the input dataset.

Possible values:

  • "Forbidden" — Missing values are not supported, their presence is interpreted as an error.
  • "Min" — Missing values are processed as the minimum value (less than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.
  • "Max" — Missing values are processed as the maximum value (greater than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.

Using the  Min or Max value of this parameter guarantees that a split between missing values and other values is considered when selecting a new split in the tree.

Note

The method for processing missing values can be set individually for each feature in the Custom quantization borders and missing value modes input file. Such values override the ones specified in this parameter.

Possible types

string

Default value

Min

input_borders

Description

Load Custom quantization borders and missing value modes from a file (do not generate them).

Borders are automatically generated before training if this parameter is not set.

Possible types

string

Default value

None

task_type

Description

The processing unit type to use for training.

Possible values:

  • CPU
  • GPU

Possible types

string

Default value

CPU

used_ram_limit

Description

Attempt to limit the amount of used CPU RAM.

Alert

  • This option affects only the CTR calculation memory usage.
  • In some cases it is impossible to limit the amount of CPU RAM used in accordance with the specified value.

Format:

<size><measure of information>

Supported measures of information (non case-sensitive):

  • MB
  • KB
  • GB

For example:

2gb

Possible types

int

Default value

None (memory usage is no limited)

random_seed

Description

The random seed used for training.

Possible types

int

Default value

None (0)

Type of return value

catboost.Pool (a quantized pool)

Usage examples

The following is the input file with the dataset description:

4	52	64	73
3	87	32	54
9	34	35	45
8	9	83	32

The pool is created as follows:

from catboost.utils import quantize

quantized_pool=quantize(data_path="pool__utils__quantize_data")
print(type(quantized_pool))

The output of this example:

<class 'catboost.core.Pool'>