Text processing JSON specification example

"text_processing_options" : {
    "tokenizers" : [{
        "tokenizer_id" : "Space",
        "delimiter" : " ",
        "lowercasing" : "true"
    }],

    "dictionaries" : [{
        "dictionary_id" : "BiGram",
        "gram_order" : "2"
    }, {
        "dictionary_id" : "Word",
        "gram_order" : "1"
    }],

    "feature_processing" : {
        "default" : [{
            "dictionaries_names" : ["Word"],
            "feature_calcers" : ["BoW"],
            "tokenizers_names" : ["Space"]
        }],
        
        "1" : [{
            "tokenizers_names" : ["Space"],
            "dictionaries_names" : ["BiGram", "Word"],
            "feature_calcers" : ["BoW"]
        }, {
            "tokenizers_names" : ["Space"],
            "dictionaries_names" : ["Word"],
            "feature_calcers" : ["NaiveBayes"]
        }]
    }
}
In this example:
  • A single split-by-delimiter tokenizer is specified. It lowercases tokens after splitting.
  • Two dictionaries: unigram (identified “Word”) and bigram (identified “BiGram”).
  • Two feature calcers are specified for the second text feature:
    • BoW, which uses the “BiGram” and “Word” dictionaries.
    • NaiveBayes, which uses the “Word” dictionary.
  • A single feature calcer is specified for all other text features: BoW, which uses the “Word” dictionary.