select_features

Purpose

Select the best features and drop harmful features from the dataset.

Method call format

model.select_features(
                X,
                y=None,
                eval_set=None,
                features_for_select=None,
                num_features_to_select=None,
                algorithm=None,
                steps=None,
                shap_calc_type=None,
                train_final_model=False,
                verbose=None,
                logging_level=None,
                plot=False,
                log_cout=sys.stdout,
                log_cerr=sys.stderr)

Parameters

X

Description

The description is different for each group of possible types.

Possible types

catboost.Pool

The input training dataset.

Note

If a nontrivial value of the cat_features parameter is specified in the constructor of this class, CatBoost checks the equivalence of categorical features indices specification from the constructor parameters and in this Pool class.

list, numpy.ndarray, pandas.DataFrame, pandas.Series

The input training dataset in the form of a two-dimensional feature matrix.

pandas.SparseDataFrame, scipy.sparse.spmatrix (all subclasses except dia_matrix)

The input training dataset in the form of a two-dimensional sparse feature matrix.

Default value

Required parameter

Supported processing units

CPU and GPU

y

Description

The target variables (in other words, the objects' label values) for the training dataset.

Must be in the form of a one- or two- dimensional array. The type of data in the array depends on the machine learning task being solved:

Binary classification
One-dimensional array containing one of:
- Booleans, integers or strings that represent the labels of the classes (only two unique values).
- Numeric values.
  The interpretation of numeric values depends on the selected loss function:
  - Logloss — The value is considered a positive class if it is strictly greater than the value of the target_border training parameter. Otherwise, it is considered a negative class.
  - CrossEntropy — The value is interpreted as the probability that the dataset object belongs to the positive class. Possible values are in the range [0; 1].
Multiclassification — One-dimensional array of integers or strings that represent the labels of the classes.
Multi label classification
Two-dimensional array. The first index is for a label/class, the second index is for an object.

Possible values depend on the selected loss function:
- MultiLogloss — Only {0, 1} or {False, True} values are allowed that specify whether an object belongs to the class corresponding to the first index.
- MultiCrossEntropy — Numerical values in the range [0; 1] that are interpreted as the probability that the dataset object belongs to the class corresponding to the first index.

Note

Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.

Possible types

list
numpy.ndarray
pandas.DataFrame
pandas.Series

Default value

None

Supported processing units

CPU and GPU

eval_set

Description

The validation dataset or datasets used for the following processes:

overfitting detector
best iteration selection
monitoring metrics' changes

Possible types

catboost.Pool
tuple (X, y)
string (path to the dataset file)

Default value

None

Supported processing units

CPU and GPU

features_for_select

Description

Features which participate in the selection. The following formats are supported:

A list with indices, names, index ranges, name ranges. For example: [0, 3, 5, 6, '10-15', 'City', 'Player1-Player11'].
A string with indices, names, index ranges, name ranges. Values are separated by commas, for example: 0,3,5,6,10-15,City,Player1-Player11.

Possible types

list
string

Default value

Required parameter

Supported processing units

CPU and GPU

num_features_to_select

Description

The number of features to select from features_for_select.

Possible types

int

Default value

Required parameter

Supported processing units

CPU and GPU

steps

Description

The number of times for training the model. Use more steps for more accurate selection.

Possible types

int

Default value

Supported processing units

CPU and GPU

algorithm

Description

The main algorithm is Recursive Feature Elimination with variable feature importance calculation method:

RecursiveByPredictionValuesChange — the fastest algorithm and the least accurate method (not recommended for ranking losses).
RecursiveByLossFunctionChange — the optimal option according to accuracy/speed balance.
RecursiveByShapValues — the most accurate method.

Possible types

EFeaturesSelectionAlgorithm

Default value

RecursiveByShapValues

Supported processing units

CPU and GPU

shap_calc_type

Description

The method of the SHAP values calculations ordered by accuracy:

Approximate
Regular
Exact

Used in RecursiveByLossFunctionChange and RecursiveByShapValues.

Possible types

EShapCalcType

Default value

Regular

Supported processing units

CPU and GPU

train_final_model

Description

If specified, then the model with selected features will be trained after features selection.

Possible types

bool

Default value

True

Supported processing units

CPU and GPU

verbose

Alias:verbose_eval

Description

The purpose of this parameter depends on the type of the given value:

bool — Defines the logging level:
- True corresponds to the Verbose logging level
- False corresponds to the Silent logging level
int — Use the Verbose logging level and set the logging period to the value of this parameter.

Alert

Do not use this parameter with the logging_level parameter.

Possible types

bool
int

Default value

Supported processing units

CPU and GPU

logging_level

Description

The logging level to output to stdout.

Possible values:

Silent — Do not output any logging information to stdout.
Verbose — Output the following data to stdout:
- optimized metric
- elapsed time of training
- remaining time of training
Info — Output additional information and the number of trees.
Debug — Output debugging information.

Possible types

string

Default value

None (corresponds to the Verbose logging level)

Supported processing units

CPU and GPU

plot

Description

Plot the following information during training:

the metric values;
the custom loss values;
the loss function change during feature selection;
the time has passed since training started;
the remaining time until the end of training.
This option can be used if training is performed in Jupyter notebook.

Possible types

bool

Default value

False

Supported processing units

CPU

log_cout

Output stream or callback for logging.

Possible types

callable Python object
python object providing the write() method

Default value

sys.stdout

log_cerr

Error stream or callback for logging.

Possible types

callable Python object
python object providing the write() method

Default value

sys.stderr

Return value

Dict with four fields:

selected_features — a list with indices of selected features.
selected_features_names — a list with names of selected features, if feature names were specified.
eliminated_features — a list with indices of eliminated features.
eliminated_features_names — a list with names of eliminated features, if feature names were specified.

Examples

from catboost import CatBoostRegressor, Pool, EShapCalcType, EFeaturesSelectionAlgorithm
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=1000, n_features=100, n_informative=20, random_state=0)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)
feature_names = ['F{}'.format(i) for i in range(train_X.shape[1])]
train_pool = Pool(train_X, train_y, feature_names=feature_names)
test_pool = Pool(test_X, test_y, feature_names=feature_names)

model = CatBoostRegressor(iterations=1000, random_seed=0)
summary = model.select_features(
    train_pool,
    eval_set=test_pool,
    features_for_select='0-99',
    num_features_to_select=10,
    steps=3,
    algorithm=EFeaturesSelectionAlgorithm.RecursiveByShapValues,
    shap_calc_type=EShapCalcType.Regular,
    train_final_model=True,
    logging_level='Silent',
    plot=True
)

The following is a chart plotted with Jupyter Notebook for the given example.

Was the article helpful?

score

set_feature_names