Dictionary

class Dictionary(token_level_type=None,
                 gram_order=None,
                 skip_step=None,
                 start_token_id=None,
                 end_of_word_policy=None,
                 end_of_sentence_policy=None,
                 occurence_lower_bound=None,
                 max_dictionary_size=None,
                 num_bpe_units=None,
                 skip_unknown=None,
                 dictionary_type='FrequencyBased')

Purpose

Process dictionaries. The text must be tokenized before working with dictionaries.

Parameters

token_level_type

Description

The token level type. This parameter defines what should be considered a separate token.
Possible values:

Word
Letter

Data types

string

Default value

Word

gram_order

Description

The number of words or letters in each token.

For example, let's assume that it is required to build a dictionary for the following set of tokens: ['maybe', 'some', 'other', 'time'].

If the token level type is set to Word and this parameter is set to 2, the following tokens are formed:

maybe some
some other
other time

Data types

int

Default value

skip_step

Description

The number of words or letters to skip when joining them to tokens. This parameter takes effect if the value of the gram_order parameter is strictly greater than 1.

For example, let's assume that it is required to build a dictionary for the following set of tokens: ['maybe', 'some', 'other', 'time'].

If the token level type is set to Word, gram_order is set to 2 and this parameter is set to 1, the following tokens are formed:

maybe other
some time

Data types

int

Default value

start_token_id

Description

The initial shift for the token identifier.

For example, let's assume that it is required to build a dictionary for the following set of tokens: ['maybe', 'some', 'other', 'time'].

If this parameter is set to 42, the following identifiers are assigned to tokens:

42 — maybe
43 — some
44 — other
45 — time

Data types

int

Default value

end_of_word_policy

Description

The policy for processing implicit tokens that point to the end of the word.

Possible values:

Skip
Insert

Data types

string

Default value

Insert

end_of_sentence_policy

Description

The policy for processing implicit tokens that point to the end of the sentence.

Possible values:

Skip
Insert

Data types

string

Default value

Skip

occurence_lower_bound

Description

The lower limit of token occurrences in the text to include it in the dictionary.

Data types

int

Default value

max_dictionary_size

Description

The maximum number of tokens in the dictionary.

Data types

int

Default value

-1 (the size of the dictionary is not limited)

num_bpe_units

Description

The number of token pairs that should be combined to a single token. The most popular tokens are combined into one and added to the dictionary as a new token.

This parameter takes effect if the value of the dictionary_type parameter is set to Bpe.

Data types

int

Default value

0 (token pairs are not combined)

skip_unknown

Description

Skip unknown tokens when building the dictionary.

This parameter takes effect if the value of the dictionary_type parameter is set to Bpe.

Data types

bool

Default value

False (a special common token is assigned for all unknown tokens)

dictionary_type

Description

The dictionary type.

Possible values:

FrequencyBased. Takes into account only the most frequent tokens. The size of the dictionary and the lower limit of token occurrences in the text to include it in the dictionary are set in occurence_lower_bound and max_dictionary_size parameters respectively.
Bpe. Takes into account the most frequent tokens and then makes new tokens from combinations of the most frequent token pairs. Refer to the Neural Machine Translation of Rare Words with Subword Units paper for algorithm details. If selected, both the Frequency Based and Bpe dictionaries are created.

Data types

string

Default value

FrequencyBased

Was the article helpful?

tokenize

fit

Dictionary

Purpose

Parameters

token_level_type

Description

gram_order

Description

skip_step

Description

start_token_id

Description

end_of_word_policy

Description

end_of_sentence_policy

Description

occurence_lower_bound

Description

max_dictionary_size

Description

num_bpe_units

Description

skip_unknown

Description

dictionary_type

Description

Methods

fit

apply

size

get_token

get_tokens

get_top_tokens

unknown_token_id

end_of_sentence_token_id

min_unused_token_id

load

save

Was the article helpful?