Overview
- Common parameters
- loss_function
- custom_metric
- eval_metric
- iterations
- learning_rate
- random_seed
- l2_leaf_reg
- bootstrap_type
- bagging_temperature
- subsample
- sampling_frequency
- sampling_unit
- mvs_reg
- random_strength
- use_best_model
- best_model_min_trees
- depth
- grow_policy
- min_data_in_leaf
- max_leaves
- ignored_features
- one_hot_max_size
- has_time
- rsm
- nan_mode
- input_borders
- output_borders
- fold_permutation_block
- leaf_estimation_method
- leaf_estimation_iterations
- leaf_estimation_backtracking
- fold_len_multiplier
- approx_on_full_history
- class_weights
- class_names
- auto_class_weights
- scale_pos_weight
- boosting_type
- boost_from_average
- langevin
- diffusion_temperature
- posterior_sampling
- allow_const_label
- score_function
- monotone_constraints
- feature_weights
- first_feature_use_penalties
- fixed_binary_splits
- penalties_coefficient
- per_object_feature_penalties
- model_shrink_rate
- model_shrink_mode
- CTR settings
- Input file settings
- Multiclassification settings
- Output settings
- Overfitting detection settings
- Performance settings
- Processing unit settings
- Quantization settings
- Text processing parameters
- Visualization settings
These parameters are for the Python package, R package and Command-line version.
For the Python package several parameters have aliases. For example, the --iterations
parameter has the following synonyms: num_boost_round
, n_estimators
, num_trees
. Simultaneous usage of different names of one parameter raises an error.
Training or inference on CUDA-enabled GPUs requires NVIDIA Driver of version 450.80.02 or higher.
Common parameters
loss_function
Command-line: --loss-function
Alias: objective
The metric to use in training. The specified value also determines the machine learning problem to solve. Some metrics support optional parameters (see the Objectives and metrics section for details on each metric).
custom_metric
Command-line: --custom-metric
Metric values to output during training. These functions are not optimized and are displayed for informational purposes only. Some metrics support optional parameters (see the Objectives and metrics section for details on each metric).
eval_metric
Command-line: --eval-metric
The metric used for overfitting detection (if enabled) and best model selection (if enabled). Some metrics support optional parameters (see the Objectives and metrics section for details on each metric).
iterations
Command-line: -i
, --iterations
Aliases: num_boost_round
, n_estimators
, num_trees
The maximum number of trees that can be built when solving machine learning problems.
learning_rate
Command-line: -w
, --learning-rate
Alias: eta
The learning rate.
Used for reducing the gradient step.
random_seed
Command-line: -r
, --random-seed
Alias:random_state
The random seed used for training.
l2_leaf_reg
Command-line: --l2-leaf-reg
, l2-leaf-regularizer
Alias: reg_lambda
Coefficient at the L2 regularization term of the cost function.
bootstrap_type
Command-line: --bootstrap-type
Bootstrap type. Defines the method for sampling the weights of objects.
bagging_temperature
Command-line: --bagging-temperature
Defines the settings of the Bayesian bootstrap. It is used by default in classification and regression modes.
subsample
Command-line: --subsample
Sample rate for bagging.
sampling_frequency
Command-line: --sampling-frequency
Frequency to sample weights and objects when building trees.
sampling_unit
Command-line: --sampling-unit
The sampling scheme.
mvs_reg
Command-line: --mvs-reg
Affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to - Bernoulli).
random_strength
Command-line: --random-strength
The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model.
use_best_model
Command-line: --use-best-model
If this parameter is set, the number of trees that are saved in the resulting model is defined.
best_model_min_trees
Command-line: --best-model-min-trees
The minimal number of trees that the best model should have. If set, the output model contains at least the given number of trees even if the optimal value of the evaluation metric on the validation dataset is achieved with smaller number of trees.
depth
Command-line: -n
, --depth
Alias: max_depth
Depth of the trees.
grow_policy
Command-line: --grow-policy
The tree growing policy. Defines how to perform greedy tree construction.
min_data_in_leaf
Command-line: --min-data-in-leaf
Alias: min_child_samples
The minimum number of training samples in a leaf. CatBoost does not search for new splits in leaves with samples count less than the specified value.
max_leaves
Command-line: --max-leaves
Alias:num_leaves
The maximum number of leafs in the resulting tree. Can be used only with the Lossguide growing policy.
ignored_features
Command-line: -I
, --ignore-features
Feature indices or names to exclude from the training. It is assumed that all passed values are feature names if at least one of the passed values can not be converted to a number or a range of numbers. Otherwise, it is assumed that all passed values are feature indices.
Specifics:
-
Non-negative indices that do not match any features are successfully ignored. For example, if five features are defined for the objects in the dataset and this parameter is set to
42
, the corresponding non-existing feature is successfully ignored. -
The identifier corresponds to the feature's index. Feature indices used in train and feature importance are numbered from 0 to
featureCount – 1
. If a file is used as input data then any non-feature column types are ignored when calculating these indices. For example, each row in the input file contains data in the following order:cat feature<\t>label value<\t>num feature
. So for the rowrock<\t>0<\t>42
, the identifier for therock
feature is 0, and for the42
feature it's 1. -
The addition of a non-existing feature name raises an error.
one_hot_max_size
Command-line: --one-hot-max-size
Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.
has_time
Command-line: --has-time
Use the order of objects in the input data (do not perform random permutations during the Transforming categorical features to numerical features and Choosing the tree structure stages).
rsm
Command-line: --rsm
Alias:colsample_bylevel
Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random.
nan_mode
Command-line: --nan-mode
The method for processing missing values in the input dataset.
input_borders
Command-line: --input-borders-file
Load Custom quantization borders and missing value modes from a file (do not generate them).
output_borders
Command-line: --output-borders-file
Save quantization borders for the current dataset to a file.
fold_permutation_block
Command-line: --fold-permutation-block
Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks.
leaf_estimation_method
Command-line: --leaf-estimation-method
The method used to calculate the values in leaves.
leaf_estimation_iterations
Command-line: --leaf-estimation-iterations
This parameter regulates how many steps are done in every tree when calculating leaf values.
leaf_estimation_backtracking
Command-line: --leaf-estimation-backtracking
When the value of the leaf_estimation_iterations
parameter is greater than 1, CatBoost makes several gradient or newton steps when calculating the resulting leaf values of a tree.
fold_len_multiplier
Command-line: --fold-len-multiplier
Coefficient for changing the length of folds.
approx_on_full_history
Command-line:--approx-on-full-history
The principles for calculating the approximated values.
class_weights
Command-line: --class-weights
Class weights. The values are used as multipliers for the object weights. This parameter can be used for solving binary classification and multiclassification problems.
class_names
Classes names. Allows to redefine the default values when using the MultiClass and Logloss metrics.
auto_class_weights
Command-line: --auto-class-weights
Automatically calculate class weights based either on the total weight or the total number of objects in each class. The values are used as multipliers for the object weights.
Supported values:
-
None — All class weights are set to 1
-
Balanced:
-
SqrtBalanced:
scale_pos_weight
The weight for class 1 in binary classification. The value is used as a multiplier for the weights of objects from class 1.
boosting_type
Command-line: --boosting-type
Boosting scheme.
boost_from_average
Command-line: --boost-from-average
Initialize approximate values by best constant value for the specified loss function.
langevin
Command-line: --langevin
Enables the Stochastic Gradient Langevin Boosting mode.
diffusion_temperature
Command-line: --diffusion-temperature
The diffusion temperature of the Stochastic Gradient Langevin Boosting mode.
posterior_sampling
Command-line: --posterior-sampling
If this parameter is set several options are specified as follows and model parameters are checked to obtain uncertainty predictions with good theoretical properties.
allow_const_label
Command-line: --allow-const-label
Use it to train models with datasets that have equal label values for all objects.
score_function
Command-line: --score-function
The score type used to select the next split during the tree construction.
monotone_constraints
Command-line: --monotone-constraints
Impose monotonic constraints on numerical features.
Possible values:
-
1
— Increasing constraint on the feature. The algorithm forces the model to be a non-decreasing function of this features. -
-1
— Decreasing constraint on the feature. The algorithm forces the model to be a non-increasing function of this features. -
0
— constraints are disabled.
feature_weights
Command-line: --feature-weights
Per-feature multiplication weights used when choosing the best split. The score of each candidate is multiplied by the weights of features from the current split.
Non-negative float values are supported for each weight.
Supported formats for setting the value of this parameter:
first_feature_use_penalties
Command-line: --first-feature-use-penalties
Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model.
Refer to the Per-object and per-feature penalties section for details on applying different score penalties.
Non-negative float values are supported for each penalty.
fixed_binary_splits
Command-line: --fixed-binary-splits
A list of indices of binary features to put at the top of each tree.
penalties_coefficient
Command-line: --penalties-coefficient
A single-value common coefficient to multiply all penalties.
per_object_feature_penalties
Command-line: --per-object-feature-penalties
Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time.
Refer to the Per-object and per-feature penalties section for details on applying different score penalties.
Non-negative float values are supported for each penalty.
model_shrink_rate
Command-line: --model-shrink-rate
The constant used to calculate the coefficient for multiplying the model on each iteration.
model_shrink_mode
Command-line: model_shrink_mode
Determines how the actual model shrinkage coefficient is calculated at each iteration.
CTR settings
simple_ctr
Quantization settings for simple categorical features. Use this parameter to specify the principles for defining the class of the object for regression tasks. By default, it is considered that an object belongs to the positive class if its' label value is greater than the median of all label values of the dataset.
combinations_ctr
Quantization settings for combinations of categorical features.
per_feature_ctr
Per-feature quantization settings for categorical features.
ctr_target_border_count
The maximum number of borders to use in target quantization for categorical features that need it. Allowed values are integers from 1 to 255 inclusively.
counter_calc_method
The method for calculating the Counter CTR type.
max_ctr_complexity
The maximum number of features that can be combined.
ctr_leaf_count_limit
The maximum number of leaves with categorical features. If the quantity exceeds the specified value a part of leaves is discarded.
store_all_simple_ctr
Ignore categorical features, which are not used in feature combinations, when choosing candidates for exclusion.
final_ctr_computation_mode
Final CTR computation mode.
Input file settings
-f, --learn-set
The path to the input file that contains the dataset description.
-t, --test-set
A comma-separated list of input files that contain the validation dataset description (the format must be the same as used in the training dataset).
--cd, --column-description
The path to the input file that contains the columns description.
--learn-pairs
The path to the input file that contains the pairs description for the training dataset.
--test-pairs
The path to the input file that contains the pairs description for the validation dataset.
--learn-group-weights
The path to the input file that contains the weights of groups. Refer to the Group weights section for format details.
--test-group-weights
The path to the input file that contains the weights of groups for the validation dataset.
--learn-baseline
The path to the input file that contains baseline values for the training dataset.
--test-baseline
The path to the input file that contains baseline values for the validation dataset.
--delimiter
The delimiter character used to separate the data in the dataset description input file.
--has-header
Read the column names from the first line of the dataset description file if this parameter is set.
--params-files
The path to the input JSON file that contains the training parameters, for example:
--nan-mode
The method for processing missing values in the input dataset.
Multiclassification settings
classes_count
Command-line: --classes-count
The upper limit for the numeric class label. Defines the number of classes for multiclassification.
--class-names
This parameter is only for Command-line.
Classes names. Allows to redefine the default values when using the MultiClass and Logloss metrics.
Output settings
logging_level
Command line: --logging-level
The logging level to output to stdout.
metric_period
Command line: --metric-period
The frequency of iterations to calculate the values of objectives and metrics.
The usage of this parameter speeds up the training.
verbose
Command line: --verbose
Alias:verbose_eval
The purpose of this parameter depends on the type of the given value:
train_dir
Command line: --train-dir
The directory for storing the files generated during training.
model_size_reg
Command line: --model-size-reg
The model size regularization coefficient. The larger the value, the smaller the model size. Refer to the Model size regularization coefficient section for details.
This regularization is needed only for models with categorical features (other models are small).
allow_writing_files
Allow to write analytical and snapshot files during training.
save_snapshot
Enable snapshotting for restoring the training progress after an interruption.
snapshot_file
The name of the file to save the training progress information in. This file is used for recovering training after an interruption.
snapshot_interval
The interval between saving snapshots in seconds.
roc_file
The name of the output file to save the ROC curve points to.
Overfitting detection settings
early_stopping_rounds
Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.
od_type
Command-line: --od-type
The type of the overfitting detector to use.
od_pval
Command-line: --od-pval
The threshold for the IncToDec overfitting detector type.
od_wait
Command-line: --od-wait
The number of iterations to continue the training after the iteration with the optimal metric value.
Performance settings
thread_count
Command-line: -T
, --thread-count
The number of threads to use during the training.
used_ram_limit
Command-line: --used-ram-limit
Attempt to limit the amount of used CPU RAM.
gpu_ram_part
Command-line: --gpu-ram-part
How much of the GPU RAM to use for training.
pinned_memory_size
Command-line: --pinned-memory-size
How much pinned (page-locked) CPU RAM to use per GPU.
gpu_cat_features_storage
Command-line: --gpu-cat-features-storage
The method for storing the categorical features' values.
data_partition
Command-line: --data-partition
The method for splitting the input dataset between multiple workers.
Processing unit settings
task_type
Command line: --task-type
The processing unit type to use for training.
devices
Command line: --devices
IDs of the GPU devices to use for training (indices are zero-based).
Quantization settings
target_border
Command-line: --target-border
If set, defines the border for converting target values to 0 and 1.
border_count
Command-line: -x
, --border-count
Alias: max_bin
The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively.
feature_border_type
Command-line: --feature-border-type
The quantization mode for numerical features.
per_float_feature_quantization
Command-line: --per-float-feature-quantization
The quantization description for the specified feature or list of features.
Text processing parameters
These parameters are only for the Python package and Command-line version.
tokenizers
Command-line: --tokenizers
Tokenizers used to preprocess Text type feature columns before creating the dictionary.
dictionaries
Command-line: --dictionaries
Dictionaries used to preprocess Text type feature columns.
Format:
feature_calcers
Command-line: --feature-calcers
Feature calcers used to calculate new features based on preprocessed Text type feature columns.
Format:
text_processing
Command-line: --text-processing
A JSON specification of tokenizers, dictionaries and feature calcers, which determine how text features are converted into a list of float features.
Refer to the description of the following parameters for details on supported values:
Visualization settings
These parameters are only for the Python package.
name
The experiment name to display in visualization tools.