Calculate object importance

Purpose

Calculate the effect of objects from the training dataset on the optimized metric values for the objects from the validation dataset:

Positive values reflect that the optimized metric increases.
Negative values reflect that the optimized metric decreases.
The higher the deviation from 0, the bigger the impact that an object has on the optimized metric.

This mode is an implementation of the approach described in the Finding Influential Training Samples for Gradient Boosted Decision Trees paper .

Execution format

catboost ostr [optional parameters]

Options

-m, --model-file, --model-path

Description

The name of the input file with the description of the model obtained as the result of training.

Default value

model.bin

--model-format

Description

The format of the input model.
Possible values:

CatboostBinary.
AppleCoreML (only datasets without categorical features are currently supported).
json (multiclassification models are not currently supported). Refer to the CatBoost JSON model tutorial for format details.
Default value

CatboostBinary

-f, --learn-set

Description

The path to the input file that contains the dataset description.

Format:

[scheme://]<path>

scheme (optional) defines the type of the input dataset. Possible values:
- quantized:// — catboost. Pool quantized pool.
- libsvm:// — dataset in the extended libsvm format.
  If omitted, a dataset in the Native CatBoost Delimiter-separated values format is expected.
path defines the path to the dataset description.

Default value

Required parameter (the path must be specified).

-t, --test-set

Description

The path to the input file that contains the validation dataset description (the format must be the same as used in the training dataset).

Default value

Required parameter

--column-description, --cd

Description

The path to the input file that contains the columns description.

Default value

If omitted, it is assumed that the first column in the file with the dataset description defines the label value, and the other columns are the values of numerical features.

-o, --output-path

Description

The path to the output file with calculated metrics.

Default value

output.tsv

-T, --thread-count

Description

The number of threads to calculate object importance.

Optimizes the speed of execution. This parameter doesn't affect results.

Default value

The number of processor cores

--delimiter

Description

The delimiter character used to separate the data in the dataset description input file.
Only single char delimiters are supported. If the specified value contains more than one character, only the first one is used.

Note

Used only if the dataset is given in the Delimiter-separated values format.

Default value

The input data is assumed to be tab-separated

--has-header

Description

False (the first line is supposed to have the same data as the rest of them)

Default value

False (the first line is supposed to have the same data as the rest of them)

--update-method

Description

The algorithm accuracy method.

Possible values:

SinglePoint — The fastest and least accurate method.
TopKLeaves — Specify the number of leaves. The higher the value, the more accurate and the slower the calculation.
AllPoints — The slowest and most accurate method.
Supported parameters:
top — Defines the number of leaves to use for the TopKLeaves update method. See the Finding Influential Training Samples for Gradient Boosted Decision Trees for more details.
For example, the following value sets the method to TopKLeaves and limits the number of leaves to 3:

TopKLeaves:top=3

Default value

SinglePoint

Was the article helpful?

Calculate feature importance

Metadata manipulation