There are two main sources of uncertainty: data uncertainty (also known as aleatoric uncertainty) and knowledge uncertainty (also known as epistemic uncertainty).
Data uncertainty arises due to the inherent complexity of the data, such as additive noise or overlapping classes. Importantly, data uncertainty cannot be reduced by collecting more training data.
Knowledge uncertainty arises when the model is given an input from a region that is either sparsely covered by the training data or far from the training data.
A single model trained with special parameter PosteriorSampling is divided into N several models — virtual ensembles, which return N predicted values when they are applied on documents.
For a document consider the vector of probabilities predicted by an ensemble of:
- Total uncertainty: where , where H is Entropy.
- Data uncertainty: .
- Knowledge uncertainty = Total uncertainty - Data uncertainty.
For a document consider the vector of predicted values .
In case when the model was trained with RMSEWithUncertainty loss-function an ensemble also predicts a vector of variances .
- Data uncertainty .
- Knowledge uncertainty .
- Total uncertainty = Data uncertainty + Knowledge uncertainty.
- NGBoost: Natural Gradient Boosting for Probabilistic Prediction (2020)
T. Duan et al.
- Uncertainty in Gradient Boosting via Ensembles (2020)
A. Ustimenko, L. Prokhorenkova and A. Malinin
arXiv preprint arXiv:2006.10562