Tokenizer
class Tokenizer(lowercasing=None,
lemmatizing=None,
number_process_policy=None,
number_token=None,
separator_type=None,
delimiter=None,
split_by_set=None,
skip_empty=None,
token_types=None,
sub_tokens_policy=None,
languages=None)
Purpose
Tokenize and process the strings.
Parameters
lowercasing
Description
Convert tokens to lower case.
Data types
bool
Default value
Tokens are not converted to lower case
lemmatizing
Description
Perform lemmatization on tokens.
Data types
bool
Default value
Lemmatization is not performed
number_process_policy
Description
The strategy to process numeric tokens. Possible values:
- Skip — Skip all numeric tokens.
- LeaveAsIs — Leave all numeric tokens as is.
- Replace — Replace all numeric tokens with a single special token. This token in specified in the
number_token
parameter.
Data types
string
Default value
LeaveAsIs
number_token
Description
The special token that is used to replace all numeric tokens with.
This option can be used if the selected numeric tokens processing strategy is Replace.
This token is not converted to lower case regardless of the value of the set value of the lowercasing
parameter.
Data types
string
Default value
Numeric tokens are left as is
separator_type
Description
The tokenization method. Possible values:
ByDelimiter
— Split by delimiter.BySense
— Try to split the string by sense.
Data types
string
Default value
ByDelimiter
with the delimiter set to “ ” (whitespace)
delimiter
Description
The symbol that is considered to be the delimiter.
Should be used if the separator_type
parameter is set to “ ”.
Data types
string
Default value
“ ” (whitespace)
split_by_set
Description
Use each single character in the delimiter
option as an individual delimiter.
Use this parameter to apply multiple delimiters.
Data types
bool
Default value
False (the whole string specified in the delimiter
parameter is considered as the delimiter)
skip_empty
Description
Skip all empty tokens.
Data types
bool
Default value
True (empty tokens are skipped)
token_types
Description
The types of tokens that should be kept after the tokenization.
Should be used if the separator_type
parameter is set to BySense
.
Possible values:
Word
Number
Punctuation
SentenceBreak
ParagraphBreak
Unknown
Data types
list
Default value
All supported types of tokens are kept
sub_tokens_policy
Description
The subtokens processing policy.
Should be used if the separator_type
parameter is set to BySense
.
Possible values:
SingleToken
— All subtokens are interpreted as a single token.SeveralTokens
— All subtokens are interpreted as several tokens.
Data types
string
Default value
SingleToken
languages
Description
The list of languages to use.
Should be used if the separator_type
parameter is set to BySense
.
Data types
list of strings
Default value
All available languages are used (significantly slows down the procedure)
Methods
tokenize
Tokenize the input string.