quantize

Quantize the given pool.

Method call format

quantize(ignored_features=None,
         per_float_feature_quantization=None,
         border_count=None,
         max_bin=None,
         feature_border_type=None,
         dev_efb_max_buckets=None,
         nan_mode=None,
         input_borders=None,
         simple_ctr=None,
         combinations_ctr=None,
         per_feature_ctr=None,
         ctr_target_border_count=None,
         task_type=None,
         used_ram_limit=None)

Parameters

ignored_features

Description

Feature indices or names to exclude from the training. It is assumed that all passed values are feature names if at least one of the passed values can not be converted to a number or a range of numbers. Otherwise, it is assumed that all passed values are feature indices.

Specifics:

  • Non-negative indices that do not match any features are successfully ignored. For example, if five features are defined for the objects in the dataset and this parameter is set to 42, the corresponding non-existing feature is successfully ignored.

  • The identifier corresponds to the feature's index. Feature indices used in train and feature importance are numbered from 0 to featureCount – 1. If a file is used as input data then any non-feature column types are ignored when calculating these indices. For example, each row in the input file contains data in the following order: cat feature<\t>label value<\t>num feature. So for the row rock<\t>0<\t>42, the identifier for the rock feature is 0, and for the 42 feature it's 1.

  • The addition of a non-existing feature name raises an error.

For example, use the following construction if features indexed 1, 2, 7, 42, 43, 44, 45, should be ignored:

[1,2,7,42,43,44,45]

Possible types

list

Default value

None

Supported processing units

CPU and GPU

per_float_feature_quantization

Description

The quantization for the given list of features (one or more).

  •   per_float_feature_quantization='0:border_count=1024'
    

    In this example, the feature indexed 0 has 1024 borders.

  •   per_float_feature_quantization=['0:border_count=1024', '1:border_count=1024']
    

    In this example, features indexed 0 and 1 have 1024 borders.

Examples:

  •   per_float_feature_quantization=['0:1024']
    

    In this example, the feature indexed 0 has 1024 borders.

    The following example is equivalent to the one given above:

    per_float_feature_quantization=['0:border_count=1024']
    
  •   per_float_feature_quantization=['0:border_count=1024','1:border_count=1024']
    

    In this example, features indexed 0 and 1 have 1024 borders.

Possible types

list of strings

Default value

None

Supported processing units

CPU and GPU

border_count

Alias:max_bin

Description

The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively.

Possible types

int

Default value

The default value depends on the processing unit type and other parameters:

  • CPU: 254
  • GPU in PairLogitPairwise and YetiRankPairwise modes: 32
  • GPU in all other modes: 128

Supported processing units

CPU and GPU

feature_border_type

Description

The quantization mode for numerical features.

Possible values:

  • Median
  • Uniform
  • UniformAndQuantiles
  • MaxLogSum
  • MinEntropy
  • GreedyLogSum

Possible types

string

Default value

GreedyLogSum

Supported processing units

CPU and GPU

dev_efb_max_buckets

Description

Maximum bucket count in exclusive features bundle.

Possible values are in the range [0, 65536]

Possible types

int

Default value

None

Supported processing units

CPU

nan_mode

Description

The method for  processing missing values in the input dataset.

Possible values:

  • "Forbidden" — Missing values are not supported, their presence is interpreted as an error.
  • "Min" — Missing values are processed as the minimum value (less than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.
  • "Max" — Missing values are processed as the maximum value (greater than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.

Using the  Min or Max value of this parameter guarantees that a split between missing values and other values is considered when selecting a new split in the tree.

Possible types

string

Default value

Min

Supported processing units

CPU and GPU

input_borders

Description

Load Custom quantization borders and missing value modes from a file (do not generate them).

Possible types

string

Default value

The file is not loaded, the values are generated

Supported processing units

CPU and GPU

simple_ctr

Description

Quantization settings for simple categorical features. Use this parameter to specify the principles for defining the class of the object for regression tasks. By default, it is considered that an object belongs to the positive class if its' label value is greater than the median of all label values of the dataset.

Format:

['CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
 'CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
  ...]

Components:

  • CtrType — The method for transforming categorical features to numerical features.

    Supported methods for training on CPU:

    • Borders
    • Buckets
    • BinarizedTargetMeanValue
    • Counter

    Supported methods for training on GPU:

    • Borders
    • Buckets
    • FeatureFreq
    • FloatTargetMeanValue
  • TargetBorderCount — The number of borders for label value quantization. Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1.

    This option is available for training on CPU only.

  • TargetBorderType — The quantization type for the label value. Only used for regression problems.

    Possible values:

    • Median
    • Uniform
    • UniformAndQuantiles
    • MaxLogSum
    • MinEntropy
    • GreedyLogSum

    By default, MinEntropy.

    This option is available for training on CPU only.

  • CtrBorderCount — The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively.

  • CtrBorderType — The quantization type for categorical features.

    Supported values for training on CPU:

    • Uniform

    Supported values for training on GPU:

    • Median
    • Uniform
    • UniformAndQuantiles
    • MaxLogSum
    • MinEntropy
    • GreedyLogSum
  • Prior — Use the specified priors during training (several values can be specified).

    Possible formats:

    • One number — Adds the value to the numerator.
    • Two slash-delimited numbers (for GPU only) — Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.

Examples

  •   simple_ctr='Borders:TargetBorderCount=2'
    

Two new features with differing quantization settings are generated. The first one concludes that an object belongs to the positive class when the label value exceeds the first border. The second one concludes that an object belongs to the positive class when the label value exceeds the second border.

For example, if the label takes three different values (0, 1, 2), the first border is 0.5 while the second one is 1.5.

  •   simple_ctr='Buckets:TargetBorderCount=2'
    

The number of features depends on the number of different labels. For example, three new features are generated if the label takes three different values (0, 1, 2). In this case, the first one concludes that an object belongs to the positive class when the value of the feature is equal to 0 or belongs to the bucket indexed 0. The second one concludes that an object belongs to the positive class when the value of the feature is equal to 1 or belongs to the bucket indexed 1, and so on.

Possible types

string

Supported processing units

CPU and GPU

combinations_ctr

Description

Quantization settings for combinations of categorical features.

['CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
 'CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
  ...]

Components:

  • CtrType — The method for transforming categorical features to numerical features.

    Supported methods for training on CPU:

    • Borders
    • Buckets
    • BinarizedTargetMeanValue
    • Counter

    Supported methods for training on GPU:

    • Borders
    • Buckets
    • FeatureFreq
    • FloatTargetMeanValue
  • TargetBorderCount — The number of borders for label value quantization. Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1.

    This option is available for training on CPU only.

  • TargetBorderType — The quantization type for the label value. Only used for regression problems.

    Possible values:

    • Median
    • Uniform
    • UniformAndQuantiles
    • MaxLogSum
    • MinEntropy
    • GreedyLogSum

    By default, MinEntropy.

    This option is available for training on CPU only.

  • CtrBorderCount — The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively.

  • CtrBorderType — The quantization type for categorical features.

    Supported values for training on CPU:

    • Uniform

    Supported values for training on GPU:

    • Uniform
    • Median
  • Prior — Use the specified priors during training (several values can be specified).

    Possible formats:

    • One number — Adds the value to the numerator.
    • Two slash-delimited numbers (for GPU only) — Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.

Possible types

string

Supported processing units

CPU and GPU

per_feature_ctr

Description

Per-feature quantization settings for categorical features.

['FeatureId:CtrType:[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
 'FeatureId:CtrType:[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
  ...]

Components:

  • FeatureId — A zero-based feature identifier.

Possible types

string

Supported processing units

CPU and GPU

ctr_target_border_count

Description

The maximum number of borders to use in target quantization for categorical features that need it. Allowed values are integers from 1 to 255 inclusively.

The value of the TargetBorderCount component overrides this parameter if it is specified for one of the following parameters:

  • simple_ctr
  • combinations_ctr
  • per_feature_ctr

Possible types

int

Default value

Number_of_classes - 1 for Multiclassification problems when training on CPU, 1 otherwise

Supported processing units

CPU and GPU

task_type

Description

The processing unit type to use for training.

Possible values:

  • CPU
  • GPU

Possible types

string

Default value

CPU

Supported processing units

CPU and GPU

used_ram_limit

Description

Alert

  • This option affects only the CTR calculation memory usage.
  • In some cases it is impossible to limit the amount of CPU RAM used in accordance with the specified value.

Attempt to limit the amount of used CPU RAM.

Format:

<size><measure of information>

Supported measures of information (non case-sensitive):

  • MB
  • KB
  • GB

For example:

2gb

Possible types

int

Default value

None (memory usage is no limited)

Supported processing units

CPU