quantize
Create a pool from a file and quantize it while loading the data. This compresses the size of the initial dataset and provides an opportunity to load huge datasets that can not be loaded to RAM otherwise.
The input data should contain only numerical features (other types are not currently supported).
This method gives an identical result to implementing the following code but is less RAM consuming:
pool = Pool(filename, **some_pool_load_params)
pool.quantize(**some_quantization_params)
return pool
Method call format
quantize(data_path,
column_description=None,
pairs=None,
graph=None,
delimiter='\t',
has_header=False,
feature_names=None,
thread_count=-1,
ignored_features=None,
per_float_feature_quantization=None,
border_count=None,
max_bin=None,
feature_border_type=None,
nan_mode=None,
input_borders=None,
task_type=None,
used_ram_limit=None,
random_seed=None)
Parameters
data_path
Description
The path to the input file that contains the dataset description.
Format:
[scheme://]<path>
-
scheme
(optional) defines the type of the input dataset. Possible values:quantized://
— catboost.Pool quantized pool.libsvm://
— dataset in the extended libsvm format.
If omitted, a dataset in the Native CatBoost Delimiter-separated values format is expected.
-
path
defines the path to the dataset description.
Possible types
string
Default value
Obligatory parameter
column_description
Description
The path to the input file that contains the columns description.
Possible types
string
Default value
None
pairs
Description
The path to the input file that contains the pairs description.
This information is used for calculation and optimization of Pairwise metrics.
Possible types
string
Default value
None
graph
Description
The path to the input file that contains the graph information for the dataset.
Graph information is used to calculate the graph aggregated features.
Possible types
string
Default value
None
delimiter
Description
The delimiter character used to separate the data in the dataset description input file.
Only single char delimiters are supported. If the specified value contains more than one character, only the first one is used.
Note
Used only if the dataset is given in the Delimiter-separated values format.
Possible types
The input data is assumed to be tab-separated
Default values
CPU and GPU
has_header
Description
Read the column names from the first line of the dataset description file if this parameter is set.
Note
Used only if the dataset is given in the Delimiter-separated values format.
Possible types
bool
Default value
False
feature_names
Description
A list of names for each feature in the dataset.
Possible types
list
Default value
None
thread_count
Description
The number of threads to use.
Optimizes the speed of execution. This parameter doesn't affect results.
Possible types
int
Default values
-1 (the number of threads is equal to the number of processor cores)
ignored_features
Description
Feature indices or names to exclude from the training. It is assumed that all passed values are feature names if at least one of the passed values can not be converted to a number or a range of numbers. Otherwise, it is assumed that all passed values are feature indices.
Specifics:
-
Non-negative indices that do not match any features are successfully ignored. For example, if five features are defined for the objects in the dataset and this parameter is set to
42
, the corresponding non-existing feature is successfully ignored. -
The identifier corresponds to the feature's index. Feature indices used in train and feature importance are numbered from 0 to
featureCount – 1
. If a file is used as input data then any non-feature column types are ignored when calculating these indices. For example, each row in the input file contains data in the following order:cat feature<\t>label value<\t>num feature
. So for the rowrock<\t>0<\t>42
, the identifier for therock
feature is 0, and for the42
feature it's 1. -
The addition of a non-existing feature name raises an error.
For example, use the following construction if features indexed 1, 2, 7, 42, 43, 44, 45, should be ignored:
[1,2,7,42,43,44,45]
Possible types
list
Default value
None
per_float_feature_quantization
Description
The quantization description for the specified feature or list of features.
Description format for a single feature:
FeatureId[:border_count=BorderCount][:nan_mode=BorderType][:border_type=border_selection_method]
Example:
per_float_feature_quantization=['0:border_count=1024', '1:border_count=1024']
In this example, features indexed 0 and 1 have 1024 borders.
Possible types
list of strings
Default value
GreedyLogSum
border_count
Alias:max_bin
Description
The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively.
Possible types
int
Default value
The default value depends on the processing unit type:
- CPU: 254
- GPU: 128
feature_border_type
Description
The quantization mode for numerical features.
Possible values:
- Median
- Uniform
- UniformAndQuantiles
- MaxLogSum
- MinEntropy
- GreedyLogSum
Possible types
string
Default value
GreedyLogSum
nan_mode
Description
The method for processing missing values in the input dataset.
Possible values:
- "Forbidden" — Missing values are not supported, their presence is interpreted as an error.
- "Min" — Missing values are processed as the minimum value (less than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.
- "Max" — Missing values are processed as the maximum value (greater than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.
Using the Min or Max value of this parameter guarantees that a split between missing values and other values is considered when selecting a new split in the tree.
Note
The method for processing missing values can be set individually for each feature in the Custom quantization borders and missing value modes input file. Such values override the ones specified in this parameter.
Possible types
string
Default value
Min
input_borders
Description
Load Custom quantization borders and missing value modes from a file (do not generate them).
Borders are automatically generated before training if this parameter is not set.
Possible types
string
Default value
None
task_type
Description
The processing unit type to use for training.
Possible values:
- CPU
- GPU
Possible types
string
Default value
CPU
used_ram_limit
Description
Attempt to limit the amount of used CPU RAM.
Alert
- This option affects only the CTR calculation memory usage.
- In some cases it is impossible to limit the amount of CPU RAM used in accordance with the specified value.
Format:
<size><measure of information>
Supported measures of information (non case-sensitive):
- MB
- KB
- GB
For example:
2gb
Possible types
int
Default value
None (memory usage is no limited)
random_seed
Description
The random seed used for training.
Possible types
int
Default value
None (0)
Type of return value
catboost.Pool (a quantized pool)
Usage examples
The following is the input file with the dataset description:
4 52 64 73
3 87 32 54
9 34 35 45
8 9 83 32
The pool is created as follows:
from catboost.utils import quantize
quantized_pool=quantize(data_path="pool__utils__quantize_data")
print(type(quantized_pool))
The output of this example:
<class 'catboost.core.Pool'>