Text processing JSON specification example

"text_processing_options" : {
    "tokenizers" : [{
        "tokenizer_id" : "Space",
        "delimiter" : " ",
        "lowercasing" : "true"
    }],

    "dictionaries" : [{
        "dictionary_id" : "BiGram",
        "gram_order" : "2"
    }, {
        "dictionary_id" : "Word",
        "gram_order" : "1"
    }],

    "feature_processing" : {
        "default" : [{
            "dictionaries_names" : ["Word"],
            "feature_calcers" : ["BoW"],
            "tokenizers_names" : ["Space"]
        }],

        "1" : [{
            "tokenizers_names" : ["Space"],
            "dictionaries_names" : ["BiGram", "Word"],
            "feature_calcers" : ["BoW"]
        }, {
            "tokenizers_names" : ["Space"],
            "dictionaries_names" : ["Word"],
            "feature_calcers" : ["NaiveBayes"]
        }]
    }
}

In this example:

  • A single split-by-delimiter tokenizer is specified. It lowercases tokens after splitting.

  • Two dictionaries: unigram (identified Word) and bigram (identified BiGram).

  • Two feature calcers are specified for the second text feature:

    • BoW, which uses the BiGram and Word dictionaries.
    • NaiveBayes, which uses the Word dictionary.
  • A single feature calcer is specified for all other text features: BoW, which uses the Word dictionary.