Overview

These parameters are for the Python package, R package and Command-line version.

For the Python package several parameters have aliases. For example, the --iterations parameter has the following synonyms: num_boost_round, n_estimators, num_trees. Simultaneous usage of different names of one parameter raises an error.

Training or inference on CUDA-enabled GPUs requires NVIDIA Driver of version 450.80.02 or higher.

Common parameters

loss_function

Command-line: --loss-function

Alias: objective

The metric to use in training. The specified value also determines the machine learning problem to solve. Some metrics support optional parameters (see the Objectives and metrics section for details on each metric).

custom_metric

Command-line: --custom-metric

Metric values to output during training. These functions are not optimized and are displayed for informational purposes only. Some metrics support optional parameters (see the Objectives and metrics section for details on each metric).

eval_metric

Command-line: --eval-metric

The metric used for overfitting detection (if enabled) and best model selection (if enabled). Some metrics support optional parameters (see the Objectives and metrics section for details on each metric).

iterations

Command-line: -i, --iterations

Aliases: num_boost_round, n_estimators, num_trees

The maximum number of trees that can be built when solving machine learning problems.

learning_rate

Command-line: -w, --learning-rate

Alias: eta

The learning rate.

Used for reducing the gradient step.

random_seed

Command-line: -r, --random-seed

Alias:random_state

The random seed used for training.

l2_leaf_reg

Command-line: --l2-leaf-reg, l2-leaf-regularizer

Alias: reg_lambda

Coefficient at the L2 regularization term of the cost function.

bootstrap_type

Command-line: --bootstrap-type

Bootstrap type. Defines the method for sampling the weights of objects.

bagging_temperature

Command-line: --bagging-temperature

Defines the settings of the Bayesian bootstrap. It is used by default in classification and regression modes.

subsample

Command-line: --subsample

Sample rate for bagging.

sampling_frequency

Command-line: --sampling-frequency

Frequency to sample weights and objects when building trees.

sampling_unit

Command-line: --sampling-unit

The sampling scheme.

mvs_reg

Command-line: --mvs-reg

Affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to \infty - Bernoulli).

random_strength

Command-line: --random-strength

The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model.

use_best_model

Command-line: --use-best-model

If this parameter is set, the number of trees that are saved in the resulting model is defined.

best_model_min_trees

Command-line: --best-model-min-trees

The minimal number of trees that the best model should have. If set, the output model contains at least the given number of trees even if the optimal value of the evaluation metric on the validation dataset is achieved with smaller number of trees.

depth

Command-line: -n, --depth

Alias: max_depth

Depth of the trees.

grow_policy

Command-line: --grow-policy

The tree growing policy. Defines how to perform greedy tree construction.

min_data_in_leaf

Command-line: --min-data-in-leaf

Alias: min_child_samples

The minimum number of training samples in a leaf. CatBoost does not search for new splits in leaves with samples count less than the specified value.

max_leaves

Command-line: --max-leaves

Alias:num_leaves

The maximum number of leafs in the resulting tree. Can be used only with the Lossguide growing policy.

ignored_features

Command-line: -I, --ignore-features

Feature indices or names to exclude from the training. It is assumed that all passed values are feature names if at least one of the passed values can not be converted to a number or a range of numbers. Otherwise, it is assumed that all passed values are feature indices.

Specifics:

  • Non-negative indices that do not match any features are successfully ignored. For example, if five features are defined for the objects in the dataset and this parameter is set to 42, the corresponding non-existing feature is successfully ignored.

  • The identifier corresponds to the feature's index. Feature indices used in train and feature importance are numbered from 0 to featureCount – 1. If a file is used as input data then any non-feature column types are ignored when calculating these indices. For example, each row in the input file contains data in the following order: cat feature<\t>label value<\t>num feature. So for the row rock<\t>0<\t>42, the identifier for the rock feature is 0, and for the 42 feature it's 1.

  • The addition of a non-existing feature name raises an error.

one_hot_max_size

Command-line: --one-hot-max-size

Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.

has_time

Command-line: --has-time

Use the order of objects in the input data (do not perform random permutations during the Transforming categorical features to numerical features and Choosing the tree structure stages).

rsm

Command-line: --rsm

Alias:colsample_bylevel

Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random.

nan_mode

Command-line: --nan-mode

The method for  processing missing values in the input dataset.

input_borders

Command-line: --input-borders-file

Load Custom quantization borders and missing value modes from a file (do not generate them).

output_borders

Command-line: --output-borders-file

Save quantization borders for the current dataset to a file.

fold_permutation_block

Command-line: --fold-permutation-block

Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks.

leaf_estimation_method

Command-line: --leaf-estimation-method

The method used to calculate the values in leaves.

leaf_estimation_iterations

Command-line: --leaf-estimation-iterations

This parameter regulates how many steps are done in every tree when calculating leaf values.

leaf_estimation_backtracking

Command-line: --leaf-estimation-backtracking

When the value of the leaf_estimation_iterations parameter is greater than 1, CatBoost makes several gradient or newton steps when calculating the resulting leaf values of a tree.

fold_len_multiplier

Command-line: --fold-len-multiplier

Coefficient for changing the length of folds.

approx_on_full_history

Command-line:--approx-on-full-history

The principles for calculating the approximated values.

class_weights

Command-line: --class-weights

Class weights. The values are used as multipliers for the object weights. This parameter can be used for solving binary classification and multiclassification problems.

class_names

Classes names. Allows to redefine the default values when using the MultiClass and Logloss metrics.

auto_class_weights

Command-line: --auto-class-weights

Automatically calculate class weights based either on the total weight or the total number of objects in each class. The values are used as multipliers for the object weights.

Supported values:

  • None — All class weights are set to 1

  • Balanced:

    CWk=maxc=1K(ti=cwi)ti=kwiCW_k=\displaystyle\frac{max_{c=1}^K(\sum_{t_{i}=c}{w_i})}{\sum_{t_{i}=k}{w_{i}}}

  • SqrtBalanced:

    CWk=maxc=1K(ti=cwi)ti=kwiCW_k=\sqrt{\displaystyle\frac{max_{c=1}^K(\sum_{t_i=c}{w_i})}{\sum_{t_i=k}{w_i}}}

scale_pos_weight

The weight for class 1 in binary classification. The value is used as a multiplier for the weights of objects from class 1.

boosting_type

Command-line: --boosting-type

Boosting scheme.

boost_from_average

Command-line: --boost-from-average

Initialize approximate values by best constant value for the specified loss function.

langevin

Command-line: --langevin

Enables the Stochastic Gradient Langevin Boosting mode.

diffusion_temperature

Command-line: --diffusion-temperature

The diffusion temperature of the Stochastic Gradient Langevin Boosting mode.

posterior_sampling

Command-line: --posterior-sampling

If this parameter is set several options are specified as follows and model parameters are checked to obtain uncertainty predictions with good theoretical properties.

allow_const_label

Command-line: --allow-const-label

Use it to train models with datasets that have equal label values for all objects.

score_function

Command-line: --score-function

The score type used to select the next split during the tree construction.

monotone_constraints

Command-line: --monotone-constraints

Impose monotonic constraints on numerical features.

Possible values:

  • 1 — Increasing constraint on the feature. The algorithm forces the model to be a non-decreasing function of this features.

  • -1 — Decreasing constraint on the feature. The algorithm forces the model to be a non-increasing function of this features.

  • 0 — constraints are disabled.

feature_weights

Command-line: --feature-weights

Per-feature multiplication weights used when choosing the best split. The score of each candidate is multiplied by the weights of features from the current split.

Non-negative float values are supported for each weight.

Supported formats for setting the value of this parameter:

first_feature_use_penalties

Command-line: --first-feature-use-penalties

Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model.

Refer to the Per-object and per-feature penalties section for details on applying different score penalties.

Non-negative float values are supported for each penalty.

fixed_binary_splits

Command-line: --fixed-binary-splits

A list of indices of binary features to put at the top of each tree.

penalties_coefficient

Command-line: --penalties-coefficient

A single-value common coefficient to multiply all penalties.

per_object_feature_penalties

Command-line: --per-object-feature-penalties

Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time.

Refer to the Per-object and per-feature penalties section for details on applying different score penalties.

Non-negative float values are supported for each penalty.

model_shrink_rate

Command-line: --model-shrink-rate

The constant used to calculate the coefficient for multiplying the model on each iteration.

model_shrink_mode

Command-line: model_shrink_mode

Determines how the actual model shrinkage coefficient is calculated at each iteration.

CTR settings

simple_ctr

Quantization settings for simple categorical features. Use this parameter to specify the principles for defining the class of the object for regression tasks. By default, it is considered that an object belongs to the positive class if its' label value is greater than the median of all label values of the dataset.

combinations_ctr

Quantization settings for combinations of categorical features.

per_feature_ctr

Per-feature quantization settings for categorical features.

ctr_target_border_count

The maximum number of borders to use in target quantization for categorical features that need it. Allowed values are integers from 1 to 255 inclusively.

counter_calc_method

The method for calculating the Counter CTR type.

max_ctr_complexity

The maximum number of features that can be combined.

ctr_leaf_count_limit

The maximum number of leaves with categorical features. If the quantity exceeds the specified value a part of leaves is discarded.

store_all_simple_ctr

Ignore categorical features, which are not used in feature combinations, when choosing candidates for exclusion.

final_ctr_computation_mode

Final CTR computation mode.

Input file settings

-f, --learn-set

The path to the input file that contains the dataset description.

-t, --test-set

A comma-separated list of input files that contain the validation dataset description (the format must be the same as used in the training dataset).

--cd, --column-description

The path to the input file that contains the columns description.

--learn-pairs

The path to the input file that contains the pairs description for the training dataset.

--test-pairs

The path to the input file that contains the pairs description for the validation dataset.

--learn-group-weights

The path to the input file that contains the weights of groups. Refer to the Group weights section for format details.

--test-group-weights

The path to the input file that contains the weights of groups for the validation dataset.

--learn-baseline

The path to the input file that contains baseline values for the training dataset.

--test-baseline

The path to the input file that contains baseline values for the validation dataset.

--delimiter

The delimiter character used to separate the data in the dataset description input file.

--has-header

Read the column names from the first line of the dataset description file if this parameter is set.

--params-files

The path to the input JSON file that contains the training parameters, for example:

--nan-mode

The method for processing missing values in the input dataset.

Multiclassification settings

classes_count

Command-line: --classes-count

The upper limit for the numeric class label. Defines the number of classes for multiclassification.

--class-names

This parameter is only for Command-line.

Classes names. Allows to redefine the default values when using the MultiClass and Logloss metrics.

Output settings

logging_level

Command line: --logging-level

The logging level to output to stdout.

metric_period

Command line: --metric-period

The frequency of iterations to calculate the values of objectives and metrics.

The usage of this parameter speeds up the training.

verbose

Command line: --verbose

Alias:verbose_eval

The purpose of this parameter depends on the type of the given value:

train_dir

Command line: --train-dir

The directory for storing the files generated during training.

model_size_reg

Command line: --model-size-reg

The model size regularization coefficient. The larger the value, the smaller the model size. Refer to the Model size regularization coefficient section for details.

This regularization is needed only for models with categorical features (other models are small).

allow_writing_files

Allow to write analytical and snapshot files during training.

save_snapshot

Enable snapshotting for restoring the training progress after an interruption.

snapshot_file

The name of the file to save the training progress information in. This file is used for recovering training after an interruption.

snapshot_interval

The interval between saving snapshots in seconds.

roc_file

The name of the output file to save the ROC curve points to.

Overfitting detection settings

early_stopping_rounds

Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.

od_type

Command-line: --od-type

The type of the overfitting detector to use.

od_pval

Command-line: --od-pval

The threshold for the IncToDec overfitting detector type.

od_wait

Command-line: --od-wait

The number of iterations to continue the training after the iteration with the optimal metric value.

Performance settings

thread_count

Command-line: -T, --thread-count

The number of threads to use during the training.

used_ram_limit

Command-line: --used-ram-limit

Attempt to limit the amount of used CPU RAM.

gpu_ram_part

Command-line: --gpu-ram-part

How much of the GPU RAM to use for training.

pinned_memory_size

Command-line: --pinned-memory-size

How much pinned (page-locked) CPU RAM to use per GPU.

gpu_cat_features_storage

Command-line: --gpu-cat-features-storage

The method for storing the categorical features' values.

data_partition

Command-line: --data-partition

The method for splitting the input dataset between multiple workers.

Processing unit settings

task_type

Command line: --task-type

The processing unit type to use for training.

devices

Command line: --devices

IDs of the GPU devices to use for training (indices are zero-based).

Quantization settings

target_border

Command-line: --target-border

If set, defines the border for converting target values to 0 and 1.

border_count

Command-line: -x, --border-count

Alias: max_bin

The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively.

feature_border_type

Command-line: --feature-border-type

The quantization mode for numerical features.

per_float_feature_quantization

Command-line: --per-float-feature-quantization

The quantization description for the specified feature or list of features.

Text processing parameters

These parameters are only for the Python package and Command-line version.

tokenizers

Command-line: --tokenizers

Tokenizers used to preprocess Text type feature columns before creating the dictionary.

dictionaries

Command-line: --dictionaries

Dictionaries used to preprocess Text type feature columns.

Format:

feature_calcers

Command-line: --feature-calcers

Feature calcers used to calculate new features based on preprocessed Text type feature columns.

Format:

text_processing

Command-line: --text-processing

A JSON specification of tokenizers, dictionaries and feature calcers, which determine how text features are converted into a list of float features.

Example

Refer to the description of the following parameters for details on supported values:

Visualization settings

These parameters are only for the Python package.

name

The experiment name to display in visualization tools.