Text processing parameters
These parameters are only for Python package and Command-line.
tokenizers
Command-line: --tokenizers
Description
Tokenizers used to preprocess Text type feature columns before creating the dictionary.
Format:
[{
'TokenizerId1': <value>,
'option_name_1': <value>,
..
'option_name_N': <value>,}]
TokenizerId
— The unique name of the tokenizer.option_name
— One of the supported tokenizer options.
Note
This parameter works with dictionaries
and feature_calcers
parameters.
For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ().
Usage example
tokenizers = [{
'tokenizerId': 'Space',
'delimiter': ' ',
'separator_type': 'ByDelimiter',
},{
'tokenizerId': 'Sense',
'separator_type': 'BySense',
}]
Type
list of json
Default value
–
Supported processing units
CPU
dictionaries
Command-line: --dictionaries
Description
Dictionaries used to preprocess Text type feature columns.
Format:
[{
'dictionaryId1': <value>,
'option_name_1': <value>,
..
'option_name_N': <value>,}]
DictionaryId
— The unique name of dictionary.option_name
— One of the supported dictionary options.
Note
This parameter works with tokenizers
and feature_calcers
parameters.
For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ().
Usage example
dictionaries = [{
'dictionaryId': 'Unigram',
'max_dictionary_size': '50000',
'gram_count': '1',
},{
'dictionaryId': 'Bigram',
'max_dictionary_size': '50000',
'gram_count': '2',
}]
Type
list of json
Default value
–
Supported processing units
CPU
feature_calcers
Command-line: --feature-calcers
Description
Feature calcers used to calculate new features based on preprocessed Text type feature columns.
Format:
['FeatureCalcerName[:option_name=option_value],
]
-
FeatureCalcerName
— The required feature calcer. -
option_name
— Additional options for feature calcers. Refer to the list of supported calcers for details on options available for each of them.
Note
This parameter works with tokenizers
and dictionaries
parameters.
For example, if a single tokenizer, three dictionaries and two feature calcers are given, a total of 6 new groups of features are created for each original text feature ().
Usage example
feature_calcers = [
'BoW:top_tokens_count=1000',
'NaiveBayes',
]
Type
list of strings
Default value
–
Supported processing units
CPU
text_processing
Command-line: --text-processing
Description
A JSON specification of tokenizers, dictionaries and feature calcers, which determine how text features are converted into a list of float features.
Refer to the description of the following parameters for details on supported values:
tokenizers
dictionaries
feature_calcers
Alert
Do not use this parameter with the following ones:
tokenizers
dictionaries
feature_calcers
Type
json
Default value
Supported processing units
CPU