Calculate object importance
Purpose
Calculate the effect of objects from the training dataset on the optimized metric values for the objects from the validation dataset:
- Positive values reflect that the optimized metric increases.
- Negative values reflect that the optimized metric decreases.
The higher the deviation from 0, the bigger the impact that an object has on the optimized metric.
This mode is an implementation of the approach described in the Finding Influential Training Samples for Gradient Boosted Decision Trees paper .
Execution format
catboost ostr [optional parameters]
Options
-m, --model-file, --model-path
Description
The name of the input file with the description of the model obtained as the result of training.
Default value
model.bin
--model-format
Description
The format of the input model.
Possible values:
- CatboostBinary.
- AppleCoreML (only datasets without categorical features are currently supported).
- json (multiclassification models are not currently supported). Refer to the CatBoost JSON model tutorial for format details.
Default value
CatboostBinary
-f, --learn-set
Description
The path to the input file that contains the dataset description.
Format:
[scheme://]<path>
scheme
(optional) defines the type of the input dataset. Possible values:quantized://
— catboost. Pool quantized pool.libsvm://
— dataset in the extended libsvm format.
If omitted, a dataset in the Native CatBoost Delimiter-separated values format is expected.
path
defines the path to the dataset description.
Default value
Required parameter (the path must be specified).
-t, --test-set
Description
The path to the input file that contains the validation dataset description (the format must be the same as used in the training dataset).
Default value
Required parameter
--column-description, --cd
Description
The path to the input file that contains the columns description.
Default value
If omitted, it is assumed that the first column in the file with the dataset description defines the label value, and the other columns are the values of numerical features.
-o, --output-path
Description
The path to the output file with calculated metrics.
Default value
output.tsv
-T, --thread-count
Description
The number of threads to calculate object importance.
Optimizes the speed of execution. This parameter doesn't affect results.
Default value
The number of processor cores
--delimiter
Description
The delimiter character used to separate the data in the dataset description input file.
Only single char delimiters are supported. If the specified value contains more than one character, only the first one is used.
Note
Used only if the dataset is given in the Delimiter-separated values format.
Default value
The input data is assumed to be tab-separated
--has-header
Description
False (the first line is supposed to have the same data as the rest of them)
Default value
False (the first line is supposed to have the same data as the rest of them)
--update-method
Description
The algorithm accuracy method.
Possible values:
- SinglePoint — The fastest and least accurate method.
- TopKLeaves — Specify the number of leaves. The higher the value, the more accurate and the slower the calculation.
- AllPoints — The slowest and most accurate method.
Supported parameters: top
— Defines the number of leaves to use for the TopKLeaves update method. See the Finding Influential Training Samples for Gradient Boosted Decision Trees for more details.
For example, the following value sets the method to TopKLeaves and limits the number of leaves to 3:
TopKLeaves:top=3
Default value
SinglePoint