0.10.x and 0.9.x releases review
We have just released the 0.10.3 version. Let's deep dive into the details and discuss the new features added in two our latest major releases 0.9.x and 0.10.x. We will also mention the breaking changes, so that you know if something will have different behaviour now.
Modes and objectives
0.9 release introduces several powerful ranking objectives: PairLogitPairwise, YetiRankPairwise and QueryCrossEntropy (GPU only). Furthermore, ranking objectives QuerySoftMax and PairLogit on CPU were improved, and we have added group weights support.
0.10 introduces implementation of both MultiClass modes on GPU (it's very fast) and reduces memory consumption by our ranking modes.
Other new feature is that our pairwise modes now support automatic pairs generation.
You can look into the ranking tutorial to learn ranking and to experiment with all our ranking modes. The automatic pairs generation is mentioned in the tutorial together with other important details about ranking.
And finally, it is possible now to use MultiClass with string labels, they will be inferred from dataset in cmd-line version both for CPU and GPU.
CatBoost now automatically calculates a good learning rate in the start of training for binary classification mode. After the training has finished, user can analyse the results and make adjustments to the selected learning rate, but it will already be a good value.
In addition, we improved accuracy for datasets with weights. These improvements are published in our 0.9 release.
As usual we improved CatBoost in several ways. Benchmark numbers for some improvements you could find below as well.
Time to load the data
Release 0.10 introduces a new way of loading data in Python using FeaturesData structure. Using FeaturesData will speed up both loading data for training and for prediction. It is especially important for prediction, because it gives around 10 to 20 times python prediction speedup.
We have examples of using FeaturesData structure in the classification tutorial.
We implemented speedups for all the modes of CPU training. In the table below you can see the cumulative improvement of CPU speed on several popular datasets:
|CatBoost / Dataset||v0.8||v0.10.3|
|Epsilon (classification)||356 sec||268 sec|
|MSRank (regression)||291 sec||183 sec|
Training of ranking modes on GPU has also been speeded up. We implemented a 50% training speedup for datasets with many features and not very many objects.
Metrics and feature importance
Now it's easy to investigate your model as far as we implemented 2-times speedup for AUC calculation in eval_metrics.
Furthermore we speeded up feature importance calculation. It used to be a bottleneck for GPU training previously, now it's not.
And finally we implemented speedups of metric calculation on GPU. Example of speedup on our internal dataset: training with - AUC eval metric with test dataset with 2M objects is speeded up 7sec => 0.2 seconds per iteration.
Applying the model
This feature speeded up for 1.5x.
Release 0.9 contains improvements for model analysis, especially integration with SHAP. Let's list them:
- Support of feature combinations inside our SHAP values implementation.
- Support of MultiClass.
- R-package now supplies with SHAP values.
- 100x speed up for SHAP values calculation. Please note, that in 0.9 version we have removed `Doc` type for feature importances due to fast calculation of SHAP values. Furthermore calculation of SHAP values in CatBoost is much faster than in other GBDT libraries, because we use symmetric trees as base predictiors. So you will not be struggling with a too slow model analysis. To experiment with SHAP values take a look in our tutorial and into SHAP GitHub page with tutorials.
Furthermore we added prettified parameter to get_feature_importance(). With prettified=True the function will return list of features with names sorted in descending order by their importance.
Starting from 0.9 CatBoost supports CUDA 9.1 only and starting from 0.10 it has a static linkage with CUDA, so you don't have to install CUDA to get CatBoost working on GPU. Hooray!
Good news for R users. We implemented GPU support in our R package. Use task_type='GPU' parameter to enable GPU training.
Furthermore it's possible now to calculate and visualise custom_metric during training on GPU the same way as it was done for CPU. Use our Jupyter visualization, CatBoost viewer or TensorBoard.
And several additional features:
- Support for external borders on GPU for cmd-line version.
- Added get_gpu_device_count() method to python package. This is a way to check if your CUDA devices are available.
Added use_weights parameter to metrics. By default all metrics, except for AUC use weights, but you can disable it now. Need to mention that both 0.9 and 0.10 releases introduce a lot of new metrics, take a look on them here.
Simple but useful feature. We've added snapshot time intervals. In many cases it will work faster if you save snapshot every 5 or 10 minutes (default value) instead of saving it on every iteration.
We have also added empty values support. Now empty value of a feature is treated as a NaN value.
To simplify work with the model it is possible now to save any meta-information into it.
Python package improvements
One of the major things that many people asked for was sklearn GridSearchCV support. We introduced it in 0.10 release. To use GridSearchCV with a dataset with categorical features you need to pass categorical feature indices when constructing estimator, and then use it in GridSearchCV. The GridSearchCV support and a better sklearn support was the reason of several breaking changes we made in 0.10:
- Removed file with model from constructor of estimator.
- Removed the following attributes and changed them to functions:
- is_fitted_ => is_fitted()
- metadata_ => get_metadata()
Next requested and implemented feature is pool slicing. See the doc for method pool.slice(doc_indices).
Made several introductions for ROC curve: new util method to build ROC curve: get_roc_curve, automatic selection of decision-boundary using ROC curve. You can select best classification boundary given the maximum FPR or FNR that you allow to the model. Take a look on catboost.select_threshold(self, data=None, curve=None, FPR=None, FNR=None, thread_count=-1). You can also calculate FPR and FNR for each boundary value.
New ways of making predictions
Latest releases provides several new ways to apply the model. It's possible to save model as Python code and as JSON. For Java users we have implemented a JNI wrapper.
Since 0.10 release we don't support Python 3.4 anymore. Please use Python 3.5, 3.6, 3.7 versions.
CatBoostClassifier and CatBoostRegressor get_params() method now returns only the params that were explicitly set when constructing the object. That means that CatBoostClassifier and CatBoostRegressor get_params() will not contain 'loss_function' if it was not specified.
We have removed calc_feature_importance parameter from Python and R. The reason for that is that now feature importance calculation is almost free, so we always calculate feature importances. Previously you could disable it using this parameter if it was slowing down your training.
In R package we have changed parameter name `target` to `label` in method save_pool().
We have also made a lot of stability improvements, have improved usability of the library, added new parameter synonyms and improved input data validations.
This is not the full list of changes and improvements. Please navigate to our release page to learn more. As you can see we made a big job to provide all this to you. Thank you for your feedback, issues and PRs. Stay tuned with CatBoost!