Text features

CatBoost supports numerical, categorical, text, and embeddings features.

Text features are used to build new numeric features. See the Transforming text features to numerical features section for details.

Choose the implementation for details on the methods and/or parameters used that are required to start using text features.

Python package

Class / method

Parameters

text_features

A one-dimensional array of text columns indices (specified as integers) or names (specified as strings).

Use only if the data parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).

If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data parameter.

Text processing parameters

Supported training parameters:

tokenizers

Description

Tokenizers used to preprocess Text type feature columns before creating the dictionary.

Format:

[{
'TokenizerId1': <value>,
'option_name_1': <value>,
..
'option_name_N': <value>,}]

TokenizerId — The unique name of the tokenizer.
option_name — One of the supported tokenizer options.

Note

This parameter works with dictionaries and feature_calcers parameters.

For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ( $1 \cdot 3 \cdot 2 = 6$ ).

Usage example

tokenizers = [{
	'tokenizerId': 'Space',
	'delimiter': ' ',
	'separator_type': 'ByDelimiter',
},{
	'tokenizerId': 'Sense',
	'separator_type': 'BySense',
}]

Possible types

list of json

Default value

–

Supported processing units

CPU and GPU

dictionaries

Description

Dictionaries used to preprocess Text type feature columns.

Format:

[{
'dictionaryId1': <value>,
'option_name_1': <value>,
..
'option_name_N': <value>,}]

DictionaryId — The unique name of dictionary.
option_name — One of the supported dictionary options.

Note

This parameter works with tokenizers and feature_calcers parameters.

For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ( $1 \cdot 3 \cdot 2 = 6$ ).

Usage example

dictionaries = [{
	'dictionary_id': 'Unigram',
	'max_dictionary_size': '50000',
	'gram_order': '1',
},{
	'dictionary_id': 'Bigram',
	'max_dictionary_size': '50000',
	'gram_order': '2',
}]

Possible types

list of json

undefined:

–

Supported processing units

CPU and GPU

feature_calcers

Description

Feature calcers used to calculate new features based on preprocessed Text type feature columns.

Format:

['FeatureCalcerName[:option_name=option_value],
]

FeatureCalcerName — The required feature calcer.
option_name — Additional options for feature calcers. Refer to the list of supported calcers for details on options available for each of them.

Note

This parameter works with tokenizers and dictionaries parameters.

For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ( $1 \cdot 3 \cdot 2 = 6$ ).

Usage example

feature_calcers = [
	'BoW:top_tokens_count=1000',
	'NaiveBayes',
]

Possible types

list of strings

Default value

–

Supported processing units

CPU and GPU

text_processing

Description

A JSON specification of tokenizers, dictionaries and feature calcers, which determine how text features are converted into a list of float features.

Example

Refer to the description of the following parameters for details on supported values:

tokenizers
dictionaries
feature_calcers

Alert

Do not use this parameter with the following ones:

tokenizers
dictionaries
feature_calcers

Possible types

json

Default value

Default value

Supported processing units

CPU and GPU

Additional classes

Additional classes are provided for text processing:

Tokenizer

Class purpose:

Tokenize and process the strings.

Dictionary

Class purpose:

Process dictionaries. The text must be tokenized before working with dictionaries.

Command-line version

For the Train a model command:

--tokenizers

Key description:

Tokenizers used to preprocess Text type feature columns before creating the dictionary.

Format:

TokenizerId[:option_name=option_value]

TokenizerId — The unique name of the tokenizer.
option_name — One of the supported tokenizer options.

Note

This parameter works with --dictionaries and --feature-calcers parameters.

For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ( $1 \cdot 3 \cdot 2 = 6$ ).

Usage example

--tokenizers "Space:delimiter= :separator_type=ByDelimiter,Sense:separator_type=BySense"

--dictionaries

Command keys:
Dictionaries used to preprocess Text type feature columns.

Format:

DictionaryId[:option_name=option_value]

DictionaryId — The unique name of dictionary.
option_name — One of the supported dictionary options.

Note

This parameter works with --tokenizers and --feature-calcers parameters.

For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ( $1 \cdot 3 \cdot 2 = 6$ ).

Usage example

--dictionaries "Unigram:gram_order=1:max_dictionary_size=50000,Bigram:gram_order=2:max_dictionary_size=50000"

--feature-calcers

Command keys:
Feature calcers used to calculate new features based on preprocessed Text type feature columns.

Format:

FeatureCalcerName[:option_name=option_value]

FeatureCalcerName — The required feature calcer.
option_name — Additional options for feature calcers. Refer to the list of supported calcers for details on options available for each of them.

Note

This parameter works with --tokenizers and --dictionaries parameters.

For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ( $1 \cdot 3 \cdot 2 = 6$ ).

Usage example

--feature-calcers BoW:top_tokens_count=1000,NaiveBayes

--text-processing

Command keys:
A JSON specification of tokenizers, dictionaries and feature calcers, which determine how text features are converted into a list of float features.

Example

Refer to the description of the following parameters for details on supported values:

--tokenizers
--dictionaries
--feature-calcers

Alert

Do not use this parameter with the following ones:

--tokenizers
--dictionaries
--feature-calcers

Was the article helpful?

Categorical features

Embeddings features