Text processing parameters

tokenizers
dictionaries
feature_calcers
text_processing

These parameters are only for Python package and Command-line.

tokenizers

Command-line: --tokenizers

Description

Tokenizers used to preprocess Text type feature columns before creating the dictionary.

Format:

[{
'TokenizerId1': <value>,
'option_name_1': <value>,
..
'option_name_N': <value>,}]

TokenizerId — The unique name of the tokenizer.
option_name — One of the supported tokenizer options.

Note

This parameter works with dictionaries and feature_calcers parameters.

For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ( $1 \cdot 3 \cdot 2 = 6$ ).

Usage example

tokenizers = [{
	'tokenizerId': 'Space',
	'delimiter': ' ',
	'separator_type': 'ByDelimiter',
},{
	'tokenizerId': 'Sense',
	'separator_type': 'BySense',
}]

Type

list of json

Default value

–

Supported processing units

CPU

dictionaries

Command-line: --dictionaries

Description

Dictionaries used to preprocess Text type feature columns.

Format:

[{
'dictionaryId1': <value>,
'option_name_1': <value>,
..
'option_name_N': <value>,}]

DictionaryId — The unique name of dictionary.
option_name — One of the supported dictionary options.

Note

This parameter works with tokenizers and feature_calcers parameters.

For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ( $1 \cdot 3 \cdot 2 = 6$ ).

Usage example

dictionaries = [{
	'dictionaryId': 'Unigram',
	'max_dictionary_size': '50000',
	'gram_count': '1',
},{
	'dictionaryId': 'Bigram',
	'max_dictionary_size': '50000',
	'gram_count': '2',
}]

Type

list of json

Default value

–

Supported processing units

CPU

feature_calcers

Command-line: --feature-calcers

Description

Feature calcers used to calculate new features based on preprocessed Text type feature columns.

Format:

['FeatureCalcerName[:option_name=option_value],
]

FeatureCalcerName — The required feature calcer.
option_name — Additional options for feature calcers. Refer to the list of supported calcers for details on options available for each of them.

Note

This parameter works with tokenizers and dictionaries parameters.

For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ( $1 \cdot 3 \cdot 2 = 6$ ).

Usage example

feature_calcers = [
	'BoW:top_tokens_count=1000',
	'NaiveBayes',
]

Type
list of strings

Default value

–

Supported processing units

CPU

text_processing

Command-line: --text-processing

Description

A JSON specification of tokenizers, dictionaries and feature calcers, which determine how text features are converted into a list of float features.

Example

Refer to the description of the following parameters for details on supported values:

tokenizers
dictionaries
feature_calcers

Alert

Do not use this parameter with the following ones:

tokenizers
dictionaries
feature_calcers

Type

json

Default value

Default value

Supported processing units

CPU

Text processing parameters

tokenizerstokenizers

DescriptionDescription

dictionariesdictionaries

DescriptionDescription

feature_calcersfeature_calcers

DescriptionDescription

text_processingtext_processing

DescriptionDescription

Was the article helpful?

tokenizers

Description

dictionaries

Description

feature_calcers

Description

text_processing

Description