Frequency Based Dictionary

Contains

The trained Frequency Based Dictionary.

Header format

The first row in the output file contains information regarding the training parameters.

Format:

{"key_1":"value_1","key_2":"value_2",.., "key_N":"value_N"}

Format

The second row contains the number of tokens in the dictionary.

Each row starting from the second contains information regarding a single token.

Format:

<token_ID><\t><number_of_occurrences><\t><token>

token ID — A zero-based token identifier. Tokens are sorted case sensitive ordering.
number_of_occurrences — The number of times that a token is found in the input text.
token — The value of the token.

Example

{"end_of_word_token_policy":"Insert","skip_step":"0","start_token_id":"0","token_level_type":"Word","dictionary_format":"id_count_token","end_of_sentence_token_policy":"Skip","gram_order":"1"}
11
0       1	How
1       1	It's
2       1	Today
3       1	and
4       1	forever
5       1	high
6       1	moon
7       1	snowing
8       1	the
9       1	today
10      1	tomorrow

Was the article helpful?

Features selection result

BPE Dictionary