Dictionary
class Dictionary(token_level_type=None,
gram_order=None,
skip_step=None,
start_token_id=None,
end_of_word_policy=None,
end_of_sentence_policy=None,
occurence_lower_bound=None,
max_dictionary_size=None,
num_bpe_units=None,
skip_unknown=None,
dictionary_type='FrequencyBased')
Purpose
Process dictionaries. The text must be tokenized before working with dictionaries.
Parameters
token_level_type
Description
The token level type. This parameter defines what should be considered a separate token.
Possible values:
- Word
- Letter
Data types
string
Default value
Word
gram_order
Description
The number of words or letters in each token.
For example, let's assume that it is required to build a dictionary for the following set of tokens: ['maybe', 'some', 'other', 'time']
.
If the token level type is set to Word and this parameter is set to 2, the following tokens are formed:
maybe some
some other
other time
Data types
int
Default value
1
skip_step
Description
The number of words or letters to skip when joining them to tokens. This parameter takes effect if the value of the gram_order
parameter is strictly greater than 1.
For example, let's assume that it is required to build a dictionary for the following set of tokens: ['maybe', 'some', 'other', 'time']
.
If the token level type is set to Word, gram_order
is set to 2 and this parameter is set to 1, the following tokens are formed:
maybe other
some time
Data types
int
Default value
0
start_token_id
Description
The initial shift for the token identifier.
For example, let's assume that it is required to build a dictionary for the following set of tokens: ['maybe', 'some', 'other', 'time']
.
If this parameter is set to 42, the following identifiers are assigned to tokens:
- 42 —
maybe
- 43 —
some
- 44 —
other
- 45 —
time
Data types
int
Default value
0
end_of_word_policy
Description
The policy for processing implicit tokens that point to the end of the word.
Possible values:
- Skip
- Insert
Data types
string
Default value
Insert
end_of_sentence_policy
Description
The policy for processing implicit tokens that point to the end of the sentence.
Possible values:
- Skip
- Insert
Data types
string
Default value
Skip
occurence_lower_bound
Description
The lower limit of token occurrences in the text to include it in the dictionary.
Data types
int
Default value
50
max_dictionary_size
Description
The maximum number of tokens in the dictionary.
Data types
int
Default value
-1 (the size of the dictionary is not limited)
num_bpe_units
Description
The number of token pairs that should be combined to a single token. The most popular tokens are combined into one and added to the dictionary as a new token.
This parameter takes effect if the value of the dictionary_type
parameter is set to Bpe.
Data types
int
Default value
0 (token pairs are not combined)
skip_unknown
Description
Skip unknown tokens when building the dictionary.
This parameter takes effect if the value of the dictionary_type
parameter is set to Bpe.
Data types
bool
Default value
False (a special common token is assigned for all unknown tokens)
dictionary_type
Description
The dictionary type.
Possible values:
- FrequencyBased. Takes into account only the most frequent tokens. The size of the dictionary and the lower limit of token occurrences in the text to include it in the dictionary are set in
occurence_lower_bound
andmax_dictionary_size
parameters respectively. - Bpe. Takes into account the most frequent tokens and then makes new tokens from combinations of the most frequent token pairs. Refer to the Neural Machine Translation of Rare Words with Subword Units paper for algorithm details. If selected, both the Frequency Based and Bpe dictionaries are created.
Data types
string
Default value
FrequencyBased
Methods
fit
Train a dictionary.
apply
Apply a previously trained dictionary to the input text.
size
Return the size of the dictionary.
get_token
Return the token that corresponds to the given identifier.
get_tokens
Return tokens that correspond to the given identifiers.
get_top_tokens
Get the specified number of top most frequent tokens.
unknown_token_id
Get the identifier of the token, which is assigned to all words that are not found in the dictionary.
end_of_sentence_token_id
Get the identifier of the last token in the sentence.
min_unused_token_id
Get the smallest unused token identifier.
load
Load the dictionary from a file.
save
Save the dictionary to a file.