apply
Apply a previously trained dictionary to the input text.
Method call format
apply(data,
tokenizer=None,
unknown_token_policy=None)
Parameters
data
Description
The input text to apply the dictionary to.
A zero-, one- or two-dimensional array-like data.
Data types
string, numpy.ndarray, pandas.DataFrame
Default value
Obligatory parameter
tokenizer
Description
The tokenizer for text processing.
If this parameter is specified and a one-dimensional data is input, each element in this list is considered a sentence and is tokenized.
Data types
Default value
None (the input data is considered tokenized)
unknown_token_policy
Description
The policy for processing unknown tokens.
Possible values:
- Skip — All unknown tokens are skipped from the resulting token ids list (empty values are put in compliance)
- Insert — A coinciding ID is put in compliance with all unknown tokens. This ID matches the number of the tokens in the dictionary.
Data types
string
Default value
Skip
Type of return value
A one- or two-dimensional array with token IDs.
Example
from catboost.text_processing import Dictionary
dictionary = Dictionary(occurence_lower_bound=0)\
.fit(["his", "tender", "heir", "whatever"])
applied_model = dictionary.apply(["might", "bear", "his", "memory"])
print(applied_model)
Output:
[[], [], [1], []]
An example with input string tokenization
from catboost.text_processing import Dictionary, Tokenizer
tokenized = Tokenizer()
dictionary = Dictionary(occurence_lower_bound=0)\
.fit(["his tender heir whatever"], tokenizer=tokenized)
applied_model = dictionary.apply(["might", "bear", "his", "memory"])
print(applied_model)
Output:
[[], [], [1], []]