BPE Dictionary

    Contains

    The trainedBPE dictionary.

    Format

    Each line contains information regarding a single new token.

    Format:

    <token_id1><\t><token_id2><\t><number_of_occurrences><\t><token>
    
    • token_id1 — The token ID of the first part of the new token.

    • token_id2 — The token ID of the second part of the new token.

    • number_of_occurrences — The number of times that a token is found in the input text.

    • token — The value of the token.

    The value of the token identifier points to the origin of the token:

    Example

    The following is the frequency based dictionary:

    {"end_of_word_token_policy":"Insert","skip_step":"0","start_token_id":"0","token_level_type":"Word","dictionary_format":"id_count_token","end_of_sentence_token_policy":"Skip","gram_order":"1"}
    11
    0       1	How
    1       1	It's
    2       1	Today
    3       1	and
    4       1	forever
    5       1	high
    6       1	moon
    7       1	snowing
    8       1	the
    9       1	today
    10      1	tomorrow
    

    The following is the BPE dictionary:

    0      5     1	How high
    1      7     1	It's snowing
    2      10    1	Today tomorrow
    3      4     1	and forever
    8      6     1	the moon
    14     9     1	It's snowing today
    13     17    1	How high the moon
    15     16    1	Today tomorrow and forever
    

    Identifiers in the range [0;10] point to tokens from the Frequency Based dictionary.

    Identifiers 11 and 12 are reserved for the unknown and end of sentence tokens respectively.

    Identifiers starting from 13 point to the tokens from the BPE dictionary.