BPE Dictionary
Contains
The trainedBPE dictionary.
Format
Each line contains information regarding a single new token.
Format:
<token_id1><\t><token_id2><\t><number_of_occurrences><\t><token>
-
token_id1
— The token ID of the first part of the new token. -
token_id2
— The token ID of the second part of the new token. -
number_of_occurrences
— The number of times that a token is found in the input text. -
token
— The value of the token.
The value of the token identifier points to the origin of the token:
- Smaller than the number of objects in the Frequency Based Dictionary — The token is taken from the frequency based dictionary.
- Greater than the number of objects in the Frequency Based Dictionary — The token is taken from the BPE dictionary.
Example
The following is the frequency based dictionary:
{"end_of_word_token_policy":"Insert","skip_step":"0","start_token_id":"0","token_level_type":"Word","dictionary_format":"id_count_token","end_of_sentence_token_policy":"Skip","gram_order":"1"}
11
0 1 How
1 1 It's
2 1 Today
3 1 and
4 1 forever
5 1 high
6 1 moon
7 1 snowing
8 1 the
9 1 today
10 1 tomorrow
The following is the BPE dictionary:
0 5 1 How high
1 7 1 It's snowing
2 10 1 Today tomorrow
3 4 1 and forever
8 6 1 the moon
14 9 1 It's snowing today
13 17 1 How high the moon
15 16 1 Today tomorrow and forever
Identifiers in the range [0;10] point to tokens from the Frequency Based dictionary.
Identifiers 11 and 12 are reserved for the unknown and end of sentence tokens respectively.
Identifiers starting from 13 point to the tokens from the BPE dictionary.