CatBoost is a machine learning algorithm that uses gradient boosting on decision trees.
This package provides classes that implement interfaces from
Apache Spark Machine Learning Library (MLLib).
For binary and multi- classification problems use CatBoostClassifier,
for regression use CatBoostRegressor.
These classes implement usual fit method of org.apache.spark.ml.Predictor that accept a single
org.apache.spark.sql.DataFrame for training, but you can also use other fit method that accepts
additional datasets for computing evaluation metrics and overfitting detection similarily to CatBoost's
This package also contains Pool class that is CatBoost's abstraction of a dataset.
It contains additional information compared to simple org.apache.spark.sql.DataFrame.
It is also possible to create Pool with quantized features before training by calling quantize method.
This is useful if this dataset is used for training multiple times and quantization parameters do not
change. Pre-quantized Pool allows to cache quantized features data and so do not re-run
feature quantization step at the start of an each training.
Detailed documentation is available on https://catboost.ai/docs/