CatBoost is a fast, scalable, high performance open-source gradient boosting on decision trees library

Get started

CatBoost papers on NeurIPS 2018

December 17, 2018

On December 2018, on NeurIPS conference in Montreal, Yandex team presented two papers related to CatBoost, an open-source machine learning library developed by Yandex.

First paper “CatBoost: unbiased boosting with categorical features” describes two of the most important features of CatBoost. These features help to avoid a target leakage, special kind of overfitting, for gradient boosting algorithm and converting categorical features to numerical ones for more effective use within machine learning algorithms. Both features rely on the ordering principle: we order training examples (by time for temporal data and randomly otherwise) and, to obtain a prediction within an iteration of boosting or the value of a numerical feature for some example, we use not all, but only the previous examples, what makes the obtained value unbiased.

In the experiments described in the paper, each feature improves the quality of classification models trained by CatBoost. Besides, the combination of these features allows CatBoost to significantly outperform XGBoost and LightGBM by quality.

Second paper with quite provocative title "Why every GBDT speed benchmark is wrong" provides comprehensive study of different ways to make speed benchmarks of gradient boosted decision trees algorithm. It shows main problems of several straight forward ways to make benchmarks, explains, why a speed benchmarking is a challenging task and provides a set of reasonable requirements for a benchmark to be fair and useful.

We were happy to hear a lot of questions from the audience related to our results and library itself. If you are also interested in, see CatBoost the library source code at http://github.com/catboost/catboost and benchmarks source code at https://github.com/catboost/benchmarks.

Latest News

0.10.x and 0.9.x releases review

CatBoost team continues to make a lot of improvements and speedups. What new and interesting have we added in our two latest releases and why is it worth to try CatBoost now? We'll discuss it in this post.

New ways to explore your data

New superb tool for exploring feature importance, new algorithm for finding most influential training samples, possibility to save your model as cpp or python code and more. Check CatBoost v0.8 details inside!

CatBoost on GPU talk at GTC 2018

Come and listen our talk about the fastest implementation of Gradient Boosting for GPU at the GTC 2018 Silicon Valley! GTC will take place on March 26-29 and will provide an excellent opportunity to get more details about CatBoost performance on GPU.

Contacts