CatBoost papers on NeurIPS 2018
On December 2018, on NeurIPS conference in Montreal, Yandex team presented two papers related to CatBoost, an open-source machine learning library developed by Yandex.
First paper “CatBoost: unbiased boosting with categorical features” describes two of the most important features of CatBoost. These features help to avoid a target leakage, special kind of overfitting, for gradient boosting algorithm and converting categorical features to numerical ones for more effective use within machine learning algorithms. Both features rely on the ordering principle: we order training examples (by time for temporal data and randomly otherwise) and, to obtain a prediction within an iteration of boosting or the value of a numerical feature for some example, we use not all, but only the previous examples, what makes the obtained value unbiased.
In the experiments described in the paper, each feature improves the quality of classification models trained by CatBoost. Besides, the combination of these features allows CatBoost to significantly outperform XGBoost and LightGBM by quality.
Second paper with quite provocative title "Why every GBDT speed benchmark is wrong" provides comprehensive study of different ways to make speed benchmarks of gradient boosted decision trees algorithm. It shows main problems of several straight forward ways to make benchmarks, explains, why a speed benchmarking is a challenging task and provides a set of reasonable requirements for a benchmark to be fair and useful.
We were happy to hear a lot of questions from the audience related to our results and library itself. If you are also interested in, see CatBoost the library source code at http://github.com/catboost/catboost and benchmarks source code at https://github.com/catboost/benchmarks.