Text features
CatBoost supports numerical, categorical, text, and embeddings features.
Text features are used to build new numeric features. See the Transforming text features to numerical features section for details.
Choose the implementation for details on the methods and/or parameters used that are required to start using text features.
Python package
Class / method
Parameters
text_features
A one-dimensional array of text columns indices (specified as integers) or names (specified as strings).
Use only if the data
parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).
If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names
parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data
parameter.
Text processing parameters
Supported training parameters:
tokenizers
Description
Tokenizers used to preprocess Text type feature columns before creating the dictionary.
Format:
[{
'TokenizerId1': <value>,
'option_name_1': <value>,
..
'option_name_N': <value>,}]
TokenizerId
— The unique name of the tokenizer.option_name
— One of the supported tokenizer options.
Note
This parameter works with dictionaries
and feature_calcers
parameters.
For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ().
Usage example
tokenizers = [{
'tokenizerId': 'Space',
'delimiter': ' ',
'separator_type': 'ByDelimiter',
},{
'tokenizerId': 'Sense',
'separator_type': 'BySense',
}]
Possible types
list of json
Default value
–
Supported processing units
CPU and GPU
dictionaries
Description
Dictionaries used to preprocess Text type feature columns.
Format:
[{
'dictionaryId1': <value>,
'option_name_1': <value>,
..
'option_name_N': <value>,}]
DictionaryId
— The unique name of dictionary.option_name
— One of the supported dictionary options.
Note
This parameter works with tokenizers
and feature_calcers
parameters.
For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ().
Usage example
dictionaries = [{
'dictionaryId': 'Unigram',
'max_dictionary_size': '50000',
'gram_count': '1',
},{
'dictionaryId': 'Bigram',
'max_dictionary_size': '50000',
'gram_count': '2',
}]
Possible types
list of json
undefined:
–
Supported processing units
CPU and GPU
feature_calcers
Description
Feature calcers used to calculate new features based on preprocessed Text type feature columns.
Format:
['FeatureCalcerName[:option_name=option_value],
]
-
FeatureCalcerName
— The required feature calcer. -
option_name
— Additional options for feature calcers. Refer to the list of supported calcers for details on options available for each of them.
Note
This parameter works with tokenizers
and dictionaries
parameters.
For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ().
Usage example
feature_calcers = [
'BoW:top_tokens_count=1000',
'NaiveBayes',
]
Possible types
list of strings
Default value
–
Supported processing units
CPU and GPU
text_processing
Description
A JSON specification of tokenizers, dictionaries and feature calcers, which determine how text features are converted into a list of float features.
Refer to the description of the following parameters for details on supported values:
tokenizers
dictionaries
feature_calcers
Alert
Do not use this parameter with the following ones:
tokenizers
dictionaries
feature_calcers
Possible types
json
Default value
Supported processing units
CPU and GPU
Additional classes
Additional classes are provided for text processing:
Tokenizer
Class purpose:
Tokenize and process the strings.
Dictionary
Class purpose:
Process dictionaries. The text must be tokenized before working with dictionaries.
Command-line version
For the Train a model command:
--tokenizers
Key description:
Tokenizers used to preprocess Text type feature columns before creating the dictionary.
Format:
TokenizerId[:option_name=option_value]
TokenizerId
— The unique name of the tokenizer.option_name
— One of the supported tokenizer options.
Note
This parameter works with --dictionaries
and --feature-calcers
parameters.
For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ().
Usage example
--tokenizers "Space:delimiter= :separator_type=ByDelimiter,Sense:separator_type=BySense"
--dictionaries
Command keys:
Dictionaries used to preprocess Text type feature columns.
Format:
DictionaryId[:option_name=option_value]
DictionaryId
— The unique name of dictionary.option_name
— One of the supported dictionary options.
Note
This parameter works with --tokenizers
and --feature-calcers
parameters.
For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ().
Usage example
--dictionaries "Unigram:gram_count=1:max_dictionary_size=50000,Bigram:gram_count=2:max_dictionary_size=50000"
--feature-calcers
Command keys:
Feature calcers used to calculate new features based on preprocessed Text type feature columns.
Format:
FeatureCalcerName[:option_name=option_value]
-
FeatureCalcerName
— The required feature calcer. -
option_name
— Additional options for feature calcers. Refer to the list of supported calcers for details on options available for each of them.
Note
This parameter works with --tokenizers
and --dictionaries
parameters.
For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ().
Usage example
--feature-calcers BoW:top_tokens_count=1000,NaiveBayes
--text-processing
Command keys:
A JSON specification of tokenizers, dictionaries and feature calcers, which determine how text features are converted into a list of float features.
Refer to the description of the following parameters for details on supported values:
--tokenizers
--dictionaries
--feature-calcers
Alert
Do not use this parameter with the following ones:
--tokenizers
--dictionaries
--feature-calcers