New ways to explore your data
It’s time to release CatBoost v0.8. The aim of this release - efficient tools for data and model exploration.
First of all, CatBoost now calculates per object feature importances using SHAP values algorithm from the ‘Consistent feature attribution for tree ensembles’ paper. As you can see on the picture below it's very easy to understand what is the influence of each feature on a given object. See tutorial for more details.
Secondly, CatBoost now has a new algorithm for finding most influential training samples for a given object. This mode calculates the effect of objects from the train dataset on the optimized metric values for the objects from the input dataset:
- Positive values reflect that the optimized metric increases.
- Negative values reflect that the optimized metric decreases.
The higher the deviation from 0, the bigger the impact that an object has on the optimized metric. The method is an implementation of the approach described in the 'Finding Influential Training Samples for Gradient Boosted Decision Trees' paper. See get_object_importance model method in Python package and ostr mode in cli-version. Tutorial for Python is also available.
Third cool staff in 0.8 release is ’save model as code’ feature. For now you could save model as Python code with categorical features and as C++ code without categorical features (сategorical features support for C++ is coming soon). Use --model-format CPP,Python in cli-version and model.save_model(OUTPUT_PYTHON_MODEL_PATH, format="python") in Python.
To find out more details check out release notices on GitHub. As usual we are eager to see your feedback and contribution.