quantize
Quantize the given pool.
Method call format
quantize(ignored_features=None,
per_float_feature_quantization=None,
border_count=None,
max_bin=None,
feature_border_type=None,
dev_efb_max_buckets=None,
nan_mode=None,
input_borders=None,
simple_ctr=None,
combinations_ctr=None,
per_feature_ctr=None,
ctr_target_border_count=None,
task_type=None,
used_ram_limit=None)
Parameters
ignored_features
Description
Feature indices or names to exclude from the training. It is assumed that all passed values are feature names if at least one of the passed values can not be converted to a number or a range of numbers. Otherwise, it is assumed that all passed values are feature indices.
Specifics:
-
Non-negative indices that do not match any features are successfully ignored. For example, if five features are defined for the objects in the dataset and this parameter is set to
42
, the corresponding non-existing feature is successfully ignored. -
The identifier corresponds to the feature's index. Feature indices used in train and feature importance are numbered from 0 to
featureCount – 1
. If a file is used as input data then any non-feature column types are ignored when calculating these indices. For example, each row in the input file contains data in the following order:cat feature<\t>label value<\t>num feature
. So for the rowrock<\t>0<\t>42
, the identifier for therock
feature is 0, and for the42
feature it's 1. -
The addition of a non-existing feature name raises an error.
For example, use the following construction if features indexed 1, 2, 7, 42, 43, 44, 45, should be ignored:
[1,2,7,42,43,44,45]
Possible types
list
Default value
None
Supported processing units
CPU and GPU
per_float_feature_quantization
Description
The quantization for the given list of features (one or more).
-
per_float_feature_quantization='0:border_count=1024'
In this example, the feature indexed 0 has 1024 borders.
-
per_float_feature_quantization=['0:border_count=1024', '1:border_count=1024']
In this example, features indexed 0 and 1 have 1024 borders.
Examples:
-
per_float_feature_quantization=['0:1024']
In this example, the feature indexed 0 has 1024 borders.
The following example is equivalent to the one given above:
per_float_feature_quantization=['0:border_count=1024']
-
per_float_feature_quantization=['0:border_count=1024','1:border_count=1024']
In this example, features indexed 0 and 1 have 1024 borders.
Possible types
list of strings
Default value
None
Supported processing units
CPU and GPU
border_count
Alias:max_bin
Description
The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively.
Possible types
int
Default value
The default value depends on the processing unit type and other parameters:
- CPU: 254
- GPU in PairLogitPairwise and YetiRankPairwise modes: 32
- GPU in all other modes: 128
Supported processing units
CPU and GPU
feature_border_type
Description
The quantization mode for numerical features.
Possible values:
- Median
- Uniform
- UniformAndQuantiles
- MaxLogSum
- MinEntropy
- GreedyLogSum
Possible types
string
Default value
GreedyLogSum
Supported processing units
CPU and GPU
dev_efb_max_buckets
Description
Maximum bucket count in exclusive features bundle.
Possible values are in the range [0, 65536]
Possible types
int
Default value
None
Supported processing units
CPU
nan_mode
Description
The method for processing missing values in the input dataset.
Possible values:
- "Forbidden" — Missing values are not supported, their presence is interpreted as an error.
- "Min" — Missing values are processed as the minimum value (less than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.
- "Max" — Missing values are processed as the maximum value (greater than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.
Using the Min or Max value of this parameter guarantees that a split between missing values and other values is considered when selecting a new split in the tree.
Possible types
string
Default value
Min
Supported processing units
CPU and GPU
input_borders
Description
Load Custom quantization borders and missing value modes from a file (do not generate them).
Possible types
string
Default value
The file is not loaded, the values are generated
Supported processing units
CPU and GPU
simple_ctr
Description
Quantization settings for simple categorical features. Use this parameter to specify the principles for defining the class of the object for regression tasks. By default, it is considered that an object belongs to the positive class if its' label value is greater than the median of all label values of the dataset.
Format:
['CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
'CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
...]
Components:
-
CtrType
— The method for transforming categorical features to numerical features.Supported methods for training on CPU:
- Borders
- Buckets
- BinarizedTargetMeanValue
- Counter
Supported methods for training on GPU:
- Borders
- Buckets
- FeatureFreq
- FloatTargetMeanValue
-
TargetBorderCount
— The number of borders for label value quantization. Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1.This option is available for training on CPU only.
-
TargetBorderType
— The quantization type for the label value. Only used for regression problems.Possible values:
- Median
- Uniform
- UniformAndQuantiles
- MaxLogSum
- MinEntropy
- GreedyLogSum
By default, MinEntropy.
This option is available for training on CPU only.
-
CtrBorderCount
— The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively. -
CtrBorderType
— The quantization type for categorical features.Supported values for training on CPU:
- Uniform
Supported values for training on GPU:
- Median
- Uniform
- UniformAndQuantiles
- MaxLogSum
- MinEntropy
- GreedyLogSum
-
Prior
— Use the specified priors during training (several values can be specified).Possible formats:
- One number — Adds the value to the numerator.
- Two slash-delimited numbers (for GPU only) — Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.
Examples
-
simple_ctr='Borders:TargetBorderCount=2'
Two new features with differing quantization settings are generated. The first one concludes that an object belongs to the positive class when the label value exceeds the first border. The second one concludes that an object belongs to the positive class when the label value exceeds the second border.
For example, if the label takes three different values (0, 1, 2), the first border is 0.5 while the second one is 1.5.
-
simple_ctr='Buckets:TargetBorderCount=2'
The number of features depends on the number of different labels. For example, three new features are generated if the label takes three different values (0, 1, 2). In this case, the first one concludes that an object belongs to the positive class when the value of the feature is equal to 0 or belongs to the bucket indexed 0. The second one concludes that an object belongs to the positive class when the value of the feature is equal to 1 or belongs to the bucket indexed 1, and so on.
Possible types
string
Supported processing units
CPU and GPU
combinations_ctr
Description
Quantization settings for combinations of categorical features.
['CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
'CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
...]
Components:
-
CtrType
— The method for transforming categorical features to numerical features.Supported methods for training on CPU:
- Borders
- Buckets
- BinarizedTargetMeanValue
- Counter
Supported methods for training on GPU:
- Borders
- Buckets
- FeatureFreq
- FloatTargetMeanValue
-
TargetBorderCount
— The number of borders for label value quantization. Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1.This option is available for training on CPU only.
-
TargetBorderType
— The quantization type for the label value. Only used for regression problems.Possible values:
- Median
- Uniform
- UniformAndQuantiles
- MaxLogSum
- MinEntropy
- GreedyLogSum
By default, MinEntropy.
This option is available for training on CPU only.
-
CtrBorderCount
— The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively. -
CtrBorderType
— The quantization type for categorical features.Supported values for training on CPU:
- Uniform
Supported values for training on GPU:
- Uniform
- Median
-
Prior
— Use the specified priors during training (several values can be specified).Possible formats:
- One number — Adds the value to the numerator.
- Two slash-delimited numbers (for GPU only) — Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.
Possible types
string
Supported processing units
CPU and GPU
per_feature_ctr
Description
Per-feature quantization settings for categorical features.
['FeatureId:CtrType:[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
'FeatureId:CtrType:[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
...]
Components:
FeatureId
— A zero-based feature identifier.
Possible types
string
Supported processing units
CPU and GPU
ctr_target_border_count
Description
The maximum number of borders to use in target quantization for categorical features that need it. Allowed values are integers from 1 to 255 inclusively.
The value of the TargetBorderCount
component overrides this parameter if it is specified for one of the following parameters:
simple_ctr
combinations_ctr
per_feature_ctr
Possible types
int
Default value
Number_of_classes - 1 for Multiclassification problems when training on CPU, 1 otherwise
Supported processing units
CPU and GPU
task_type
Description
The processing unit type to use for training.
Possible values:
- CPU
- GPU
Possible types
string
Default value
CPU
Supported processing units
CPU and GPU
used_ram_limit
Description
Alert
- This option affects only the CTR calculation memory usage.
- In some cases it is impossible to limit the amount of CPU RAM used in accordance with the specified value.
Attempt to limit the amount of used CPU RAM.
Format:
<size><measure of information>
Supported measures of information (non case-sensitive):
- MB
- KB
- GB
For example:
2gb
Possible types
int
Default value
None (memory usage is no limited)
Supported processing units
CPU