FeaturesData
class FeaturesData(num_feature_data=None,
cat_feature_data=None,
num_feature_names=None,
cat_feature_names=None)
Purpose
Allows to optimally store the feature data for further passing to the Pool constructor. The creation of pools from this representation is much faster than from generic numpy.ndarray, pandas.DataFrame or pandas.Series if the dataset contains both numerical and categorical features, most of which are numerical. Pass numpy.ndarray with numpy.float32 dtype to get similar performance with datasets that contain only numerical features.
Warning
FeaturesData makes no checks at all to the input data. Use it only if there is confidence that everything is being done correctly, and it is preferable to avoid spending additional time on checks. Otherwise, pass the input dataset and target variables directly to the Pool class.
Parameters
num_feature_data
Description
Numerical features for all objects from the dataset in the form of numpy.ndarray of shape (object_count x num_feature_count)
with dtype numpy.float32
.
Possible types
numpy.ndarray
Default value
None (the dataset does not contain numerical features)
cat_feature_data
Description
Categorical features for all objects from the dataset in the form of numpy.ndarray of shape (object_count x cat_feature_count)
with dtype object
.
The elements must be of bytes type and should contain UTF-8 encoded strings.
Categorical features must be passed as strings, for example:
data=FeaturesData(cat_feature_data=np.array([['a','c'], ['b', 'c']], dtype=object))
Using other data types (for example, int32) raises an error.
Possible types
numpy.ndarray
Default value
None (the dataset does not contain categorical features)
num_feature_names
Description
The names of numerical features in the form of a sequence of strings or bytes.
If the string is represented by the bytes type, it must be UTF-8 encoded.
Possible types
-
list of strings
-
list of bytes
Default value
None (the num_feature_names
data attribute is set to a list of empty strings)
cat_feature_names
Description
The names of categorical features in the form of a sequence of strings or bytes.
If the string is represented by the bytes type, it must be UTF-8 encoded.
Possible types
-
list of strings
-
list of bytes
Default value
None (the cat_feature_names
data attribute is set to a list of empty strings)
Specifics
-
The order of features in the created Pool is the following:
[num_features (if any present)][cat_features (if any present)]
-
The feature data must be passed in the same order when applying the trained model.
Methods
Method | Description |
---|---|
get_cat_feature_count | Return the number of categorical features contained in the dataset. |
get_feature_count | Return the total number of features (both numerical and categorical) contained in the dataset. |
get_feature_names | Return the names of features from the dataset. |
get_num_feature_count | Return the number of numerical features contained in the dataset. |
get_object_count | Return the number of objects contained in the dataset. |
Usage examples
CatBoostClassifier with FeaturesData
import numpy as np
from catboost import CatBoostClassifier, FeaturesData
# Initialize data
cat_features = [0,1,2]
train_data = FeaturesData(
num_feature_data=np.array([[1, 4, 5, 6], [4, 5, 6, 7], [30, 40, 50, 60]], dtype=np.float32),
cat_feature_data=np.array([["a", "b"], ["a", "b"], ["c", "d"]], dtype=object)
)
train_labels = [1,1,-1]
test_data = FeaturesData(
num_feature_data=np.array([[2, 4, 6, 8], [1, 4, 50, 60]], dtype=np.float32),
cat_feature_data=np.array([["a", "b"], ["a", "d"]], dtype=object)
)
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=2, learning_rate=1, depth=2, loss_function='Logloss')
# Fit model
model.fit(train_data, train_labels)
# Get predicted classes
preds_class = model.predict(test_data)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(test_data)
# Get predicted RawFormulaVal
preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')
CatBoostClassifier with Pool and FeaturesData
import numpy as np
from catboost import CatBoostClassifier, FeaturesData, Pool
# Initialize data
train_data = Pool(
data=FeaturesData(
num_feature_data=np.array([[1, 4, 5, 6],
[4, 5, 6, 7],
[30, 40, 50, 60]],
dtype=np.float32),
cat_feature_data=np.array([["a", "b"],
["a", "b"],
["c", "d"]],
dtype=object)
),
label=[1, 1, -1]
)
test_data = Pool(
data=FeaturesData(
num_feature_data=np.array([[2, 4, 6, 8],
[1, 4, 50, 60]],
dtype=np.float32),
cat_feature_data=np.array([["a", "b"],
["a", "d"]],
dtype=object)
)
)
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations = 2,
learning_rate = 1,
depth = 2,
loss_function = 'Logloss')
# Fit model
model.fit(train_data)
# Get predicted classes
preds_class = model.predict(test_data)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(test_data)
# Get predicted RawFormulaVal
preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')