apply

Apply a previously trained dictionary to the input text.

Method call format

apply(data,
      tokenizer=None,
      unknown_token_policy=None)

Parameters

data

Description

The input text to apply the dictionary to.

A zero-, one- or two-dimensional array-like data.

Data types

string, numpy.array, pandas.DataFrame

Default value

Obligatory parameter

tokenizer

Description

The tokenizer for text processing.

If this parameter is specified and a one-dimensional data is input, each element in this list is considered a sentence and is tokenized.

Data types

Tokenizer

Default value

None (the input data is considered tokenized)

unknown_token_policy

Description

The policy for processing unknown tokens.

Possible values:

  • Skip — All unknown tokens are skipped from the resulting token ids list (empty values are put in compliance)
  • Insert — A coinciding ID is put in compliance with all unknown tokens. This ID matches the number of the tokens in the dictionary.

Data types

string

Default value

Skip

Type of return value

A one- or two-dimensional array with token IDs.

Example

from catboost.text_processing import Dictionary

dictionary = Dictionary(occurence_lower_bound=0)\
    .fit(["his", "tender", "heir", "whatever"])

applied_model = dictionary.apply(["might", "bear", "his", "memory"])

print(applied_model)

Output:

[[], [], [1], []]

An example with input string tokenization

from catboost.text_processing import Dictionary, Tokenizer

tokenized = Tokenizer()

dictionary = Dictionary(occurence_lower_bound=0)\
    .fit(["his tender heir whatever"], tokenizer=tokenized)

applied_model = dictionary.apply(["might", "bear", "his", "memory"])

print(applied_model)

Output:

[[], [], [1], []]