tokenize

Tokenize the input string.

Method call format

tokenize(s)

Parameters

s

Description

The input string that has to be tokenized.

Data types

String

Default value

Obligatory parameter

Type of return value

A list of tokens.

Example

from catboost.text_processing import Tokenizer


text="Still, I would love to see you at 12, if you don't mind"

tokenized = Tokenizer(lowercasing=True,
                      separator_type='BySense',
                      token_types=['Word', 'Number']).tokenize(text)

print tokenized

Output:

['still', 'i', 'would', 'love', 'to', 'see', 'you', 'at', '12', 'if', 'you', "don't", 'mind']