Dictionary

class Dictionary(token_level_type=None,
                 gram_order=None,
                 skip_step=None,
                 start_token_id=None,
                 end_of_word_policy=None,
                 end_of_sentence_policy=None,
                 occurence_lower_bound=None,
                 max_dictionary_size=None,
                 num_bpe_units=None,
                 skip_unknown=None,
                 dictionary_type='FrequencyBased')

Purpose

Process dictionaries. The text must be tokenized before working with dictionaries.

Parameters

token_level_type

Description

The token level type. This parameter defines what should be considered a separate token.
Possible values:

  • Word
  • Letter

Data types

string

Default value

Word

gram_order

Description

The number of words or letters in each token.

For example, let's assume that it is required to build a dictionary for the following set of tokens: ['maybe', 'some', 'other', 'time'].

If the token level type is set to Word and this parameter is set to 2, the following tokens are formed:

  • maybe some
  • some other
  • other time

Data types

int

Default value

1

skip_step

Description

The number of words or letters to skip when joining them to tokens. This parameter takes effect if the value of the gram_order parameter is strictly greater than 1.

For example, let's assume that it is required to build a dictionary for the following set of tokens: ['maybe', 'some', 'other', 'time'].

If the token level type is set to Word, gram_order is set to 2 and this parameter is set to 1, the following tokens are formed:

  • maybe other
  • some time

Data types

int

Default value

0

start_token_id

Description

The initial shift for the token identifier.

For example, let's assume that it is required to build a dictionary for the following set of tokens: ['maybe', 'some', 'other', 'time'].

If this parameter is set to 42, the following identifiers are assigned to tokens:

  • 42 — maybe
  • 43 — some
  • 44 — other
  • 45 — time

Data types

int

Default value

0

end_of_word_policy

Description

The policy for processing implicit tokens that point to the end of the word.

Possible values:

  • Skip
  • Insert

Data types

string

Default value

Insert

end_of_sentence_policy

Description

The policy for processing implicit tokens that point to the end of the sentence.

Possible values:

  • Skip
  • Insert

Data types

string

Default value

Skip

occurence_lower_bound

Description

The lower limit of token occurrences in the text to include it in the dictionary.

Data types

int

Default value

50

max_dictionary_size

Description

The maximum number of tokens in the dictionary.

Data types

int

Default value

-1 (the size of the dictionary is not limited)

num_bpe_units

Description

The number of token pairs that should be combined to a single token. The most popular tokens are combined into one and added to the dictionary as a new token.

This parameter takes effect if the value of the dictionary_type parameter is set to Bpe.

Data types

int

Default value

0 (token pairs are not combined)

skip_unknown

Description

Skip unknown tokens when building the dictionary.

This parameter takes effect if the value of the dictionary_type parameter is set to Bpe.

Data types

bool

Default value

False (a special common token is assigned for all unknown tokens)

dictionary_type

Description

The dictionary type.

Possible values:

  • FrequencyBased. Takes into account only the most frequent tokens. The size of the dictionary and the lower limit of token occurrences in the text to include it in the dictionary are set in occurence_lower_bound and max_dictionary_size parameters respectively.
  • Bpe. Takes into account the most frequent tokens and then makes new tokens from combinations of the most frequent token pairs. Refer to the Neural Machine Translation of Rare Words with Subword Units paper for algorithm details. If selected, both the Frequency Based and Bpe dictionaries are created.

Data types

string

Default value

FrequencyBased

Methods

fit

Train a dictionary.

apply

Apply a previously trained dictionary to the input text.

size

Return the size of the dictionary.

get_token

Return the token that corresponds to the given identifier.

get_tokens

Return tokens that correspond to the given identifiers.

get_top_tokens

Get the specified number of top most frequent tokens.

unknown_token_id

Get the identifier of the token, which is assigned to all words that are not found in the dictionary.

end_of_sentence_token_id

Get the identifier of the last token in the sentence.

min_unused_token_id

Get the smallest unused token identifier.

load

Load the dictionary from a file.

save

Save the dictionary to a file.

Previous
Next