Text processing JSON specification example

    "text_processing_options" : {
        "tokenizers" : [{
            "tokenizer_id" : "Space",
            "delimiter" : " ",
            "lowercasing" : "true"
        }],
    
        "dictionaries" : [{
            "dictionary_id" : "BiGram",
            "gram_order" : "2"
        }, {
            "dictionary_id" : "Word",
            "gram_order" : "1"
        }],
    
        "feature_processing" : {
            "default" : [{
                "dictionaries_names" : ["Word"],
                "feature_calcers" : ["BoW"],
                "tokenizers_names" : ["Space"]
            }],
            
            "1" : [{
                "tokenizers_names" : ["Space"],
                "dictionaries_names" : ["BiGram", "Word"],
                "feature_calcers" : ["BoW"]
            }, {
                "tokenizers_names" : ["Space"],
                "dictionaries_names" : ["Word"],
                "feature_calcers" : ["NaiveBayes"]
            }]
        }
    }
    

    In this example:

    • A single split-by-delimiter tokenizer is specified. It lowercases tokens after splitting.

    • Two dictionaries: unigram (identified Word) and bigram (identified BiGram).

    • Two feature calcers are specified for the second text feature:

      • BoW, which uses the BiGram and Word dictionaries.
      • NaiveBayes, which uses the Word dictionary.
    • A single feature calcer is specified for all other text features: BoW, which uses the Word dictionary.