Boosting is a method which builds a prediction model FT as an ensemble of weak learners FT=t=1∑Tft.
In our case, ft is a decision tree. Trees are built sequentially and each next tree is built to approximate negative gradients gi of the loss function l at predictions of the current ensemble: gi=−∂a∂l(a,yi)a=FT−1(xi)
Thus, it performs a gradient descent optimization of the function L. The quality of the gradient approximation is measured by a score function Score(a,g)=S(a,g).
Let's suppose that it is required to add a new tree to the ensemble. A score function is required in order to choose between candidate trees. Given a candidate tree f let ai denote f(xi), wi — the weight of i-th object, and gi – the corresponding gradient of l. Let’s consider the following score functions:
Let's suppose that it is required to find the structure for the tree f of depth 1. The structure of such tree is determined by the index j of some feature and a border value c. Let xi,j be the value of the j-th feature on the i-th object and aleft and aright be the values at leafs of f. Then, f(xi) equals to aleft if xi,j≤c and aright if xi,j>c. Now the goal is to find the best j and c in terms of the chosen score function.
For the L2 score function the formula takes the following form:
The latter argmax can be calculated by brute force search.
The situation is slightly more complex when the tree depth is bigger than 1:
L2 score function: S is converted into a sum over leaves S(a,g)=∑leafS(aleaf,gleaf). The next step is to find j∗,c∗=argmaxj,cS(aˉ,g), where aˉ are the optimal values in leaves after the j∗,c∗ split.
Depthwise and Lossguide methods: j,c are sets of {jk},{ck}. k stands for the index of the leaf, therefore the score function S takes the following form: S(aˉ,g)=∑l=leafS(aˉ(jl,cl),gl). Since S(leaf) is a convex function, different jk1,ck1 and jk2,ck2 (splits for different leaves) can be searched separately by finding the optimal j∗,c∗=argmaxj,c{S(leafleft)+S(leafright)−S(leafbefore_split)}.
SymmetricTree method: The same j,c are attempted to be found for each leaf, thus it's required to optimize the total sum over all leaves S(a,g)=∑leafS(leaf).
The summation is over i such that the object xi gets to the considered leaf. Then these optimal values of ϕleaf can be used instead of weighted averages of gradients (aleft∗ and aright∗ in the example above) in the same score functions.
CatBoost provides the following methods to affect the score with penalties:
Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model.
Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time.
The final score is calculated as follows: Score′=Score⋅∏f∈SWf−∑f∈SPf⋅U(f)−∑f∈S∑x∈LEPf⋅U(f,x)
Wf is the feature weight
Pf is the per-feature penalty
EPf is the per-object penalty
S is the current split
L is the current leaf
U(f)={0,1,if f was used in model alreadyotherwise
U(f,x)={0,1,if f was used already for object xotherwise