Text processing JSON specification example
"text_processing_options" : {
"tokenizers" : [{
"tokenizer_id" : "Space",
"delimiter" : " ",
"lowercasing" : "true"
}],
"dictionaries" : [{
"dictionary_id" : "BiGram",
"gram_order" : "2"
}, {
"dictionary_id" : "Word",
"gram_order" : "1"
}],
"feature_processing" : {
"default" : [{
"dictionaries_names" : ["Word"],
"feature_calcers" : ["BoW"],
"tokenizers_names" : ["Space"]
}],
"1" : [{
"tokenizers_names" : ["Space"],
"dictionaries_names" : ["BiGram", "Word"],
"feature_calcers" : ["BoW"]
}, {
"tokenizers_names" : ["Space"],
"dictionaries_names" : ["Word"],
"feature_calcers" : ["NaiveBayes"]
}]
}
}
In this example:
-
A single split-by-delimiter tokenizer is specified. It lowercases tokens after splitting.
-
Two dictionaries: unigram (identified
Word
) and bigram (identifiedBiGram
). -
Two feature calcers are specified for the second text feature:
- BoW, which uses the
BiGram
andWord
dictionaries. - NaiveBayes, which uses the
Word
dictionary.
- BoW, which uses the
-
A single feature calcer is specified for all other text features: BoW, which uses the
Word
dictionary.