Tokenizer options

The following is a list of options for the tokenizers parameter (these options are set in option_name):

Parameter Data types Description Default value
lowercasing bool

Convert tokens to lower case.

Tokens are not converted to lower case
number_process_policy string

The strategy to process numeric tokens. Possible values:

  • Skip — Skip all numeric tokens.
  • LeaveAsIs — Leave all numeric tokens as is.
  • Replace — Replace all numeric tokens with a single special token. This token in specified in the number_token parameter.
LeaveAsIs
number_token string

The special token that is used to replace all numeric tokens with.

This option can be used if the selected numeric tokens processing strategy is Replace.

This token is not converted to lower case regardless of the value of the set value of the lowercasing parameter.

Numeric tokens are left as is
separator_type string

The tokenization method. Possible values:

  • ByDelimiter — Split by delimiter.
  • BySense — Try to split the string by sense.
ByDelimiter  with the delimiter set to “ ” (whitespace)
delimiter string

The symbol that is considered to be the delimiter.

Should be used if the separator_type parameter is set to BySense.

“ ” (whitespace)
split_by_set bool

Use each single character in the delimiter option as an individual delimiter.

Use this parameter to apply multiple delimiters.

False (the whole string specified in the delimiter parameter is considered as the delimiter)
skip_empty bool Skip all empty tokens. True (empty tokens are skipped)
token_types list

The types of tokens that should be kept after the tokenization.

Should be used if the separator_type parameter is set to BySense.

Possible values:
  • Word
  • Number
  • Punctuation
  • SentenceBreak
  • ParagraphBreak
  • Unknown
All supported types of tokens are kept
sub_tokens_policy string

The subtokens processing policy.

Should be used if the separator_type parameter is set to BySense.

Possible values:
  • SingleToken — All subtokens are interpreted as a single token.
  • SeveralTokens — All subtokens are interpreted as several tokens.
SingleToken
Parameter Data types Description Default value
lowercasing bool

Convert tokens to lower case.

Tokens are not converted to lower case
number_process_policy string

The strategy to process numeric tokens. Possible values:

  • Skip — Skip all numeric tokens.
  • LeaveAsIs — Leave all numeric tokens as is.
  • Replace — Replace all numeric tokens with a single special token. This token in specified in the number_token parameter.
LeaveAsIs
number_token string

The special token that is used to replace all numeric tokens with.

This option can be used if the selected numeric tokens processing strategy is Replace.

This token is not converted to lower case regardless of the value of the set value of the lowercasing parameter.

Numeric tokens are left as is
separator_type string

The tokenization method. Possible values:

  • ByDelimiter — Split by delimiter.
  • BySense — Try to split the string by sense.
ByDelimiter  with the delimiter set to “ ” (whitespace)
delimiter string

The symbol that is considered to be the delimiter.

Should be used if the separator_type parameter is set to BySense.

“ ” (whitespace)
split_by_set bool

Use each single character in the delimiter option as an individual delimiter.

Use this parameter to apply multiple delimiters.

False (the whole string specified in the delimiter parameter is considered as the delimiter)
skip_empty bool Skip all empty tokens. True (empty tokens are skipped)
token_types list

The types of tokens that should be kept after the tokenization.

Should be used if the separator_type parameter is set to BySense.

Possible values:
  • Word
  • Number
  • Punctuation
  • SentenceBreak
  • ParagraphBreak
  • Unknown
All supported types of tokens are kept
sub_tokens_policy string

The subtokens processing policy.

Should be used if the separator_type parameter is set to BySense.

Possible values:
  • SingleToken — All subtokens are interpreted as a single token.
  • SeveralTokens — All subtokens are interpreted as several tokens.
SingleToken