Score functions

The common approach to solve supervised learning tasks is to minimize the loss function LL:

L(f(x),y)=iwil(f(xi),yi)+J(f),whereL\left(f(x), y\right) = \sum\limits_{i} w_{i} \cdot l \left(f(x_{i}), y_{i}\right) + J(f){ , where}

  • l(f(x),y)l\left( f(x), y\right) is the value of the loss function at the point (x,y)(x, y)
  • wiw_{i} is the weight of the ii-th object
  • J(f)J(f) is the regularization term.

For example, these formulas take the following form for linear regression:

  • l(f(x),y)=wi((θ,x)y)2l\left( f(x), y\right) = w_{i} \left( (\theta, x) - y \right)^{2} (mean squared error)
  • J(f)=λθl2J(f) = \lambda \left| | \theta | \right|_{l2} (L2 regularization)

Gradient boosting

Boosting is a method which builds a prediction model FTF^{T} as an ensemble of weak learners FT=t=1TftF^{T} = \sum\limits_{t=1}^{T} f^{t}.

In our case, ftf^{t} is a decision tree. Trees are built sequentially and each next tree is built to approximate negative gradients gig_{i} of the loss function ll at predictions of the current ensemble:
gi=l(a,yi)aa=FT1(xi)g_{i} = -\frac{\partial l(a, y_{i})}{\partial a} \Bigr|_{a = F^{T-1}(x_{i})}
Thus, it performs a gradient descent optimization of the function LL. The quality of the gradient approximation is measured by a score function Score(a,g)=S(a,g)Score(a, g) = S(a, g).

Types of score functions

Let's suppose that it is required to add a new tree to the ensemble. A score function is required in order to choose between candidate trees. Given a candidate tree ff let aia_{i} denote f(xi)f(x_{i}), wiw_{i} — the weight of ii-th object, and gig_{i} – the corresponding gradient of ll. Let’s consider the following score functions:

  • L2=iwi(aigi)2L2 = - \sum\limits_{i} w_{i} \cdot (a_{i} - g_{i})^{2}
  • Cosine=wiaigiwiai2wigi2Cosine = \displaystyle\frac{\sum w_{i} \cdot a_{i} \cdot g_{i}}{\sqrt{\sum w_{i}a_{i}^{2}} \cdot \sqrt{\sum w_{i}g_{i}^{2}}}

Finding the optimal tree structure

Let's suppose that it is required to find the structure for the tree ff of depth 1. The structure of such tree is determined by the index jj of some feature and a border value cc. Let xi,jx_{i, j} be the value of the jj-th feature on the ii-th object and alefta_{left} and arighta_{right} be the values at leafs of ff. Then, f(xi)f(x_{i}) equals to alefta_{left} if xi,jcx_{i,j} \leq c and arighta_{right} if xi,j>cx_{i,j} > c. Now the goal is to find the best jj and cc in terms of the chosen score function.

For the L2 score function the formula takes the following form:

S(a,g)=iwi(aigi)2=(i:xi,jcwi(aleftgi)2+i:xi,j>cwi(arightgi)2)S(a, g) = -\sum\limits_{i} w_{i} (a_{i} - g_{i})^{2} = - \left( \displaystyle\sum\limits_{i:x_{i,j}\leq c} w_{i}(a_{left} - g_{i})^{2} + \sum\limits_{i: x_{i,j}>c} w_{i}(a_{right} - g_{i})^{2} \right)

Let's denote Wleft=i:xI,jcwiW_{left} = \displaystyle\sum_{i: x_{I,j} \leq c} w_{i} and Wright=i:xi,j>cwiW_{right} = \displaystyle\sum_{i: x_{i,j} >c} w_{i}.

The optimal values for alefta_{left} and arighta_{right} are the weighted averages:

  • aleft=i:xi,jcwigiWlefta^{*}_{left} =\displaystyle\frac{\sum\limits_{i: x_{i,j} \leq c} w_{i} g_{i}}{W_{left}}
  • aright=i:xi,j>cwigiWrighta^{*}_{right} =\displaystyle\frac{\sum\limits_{i: x_{i,j} > c} w_{i} g_{i}}{W_{right}}

After expanding brackets and removing terms, which are constant in the optimization:

j,c=argmaxj,cWleft(aleft)2+Wright(aright)2j^{*}, c^{*} = argmax_{j, c} W_{left} \cdot (a^{*}_{left})^{2} + W_{right} \cdot (a^{*}_{right})^{2}

The latter argmax can be calculated by brute force search.

The situation is slightly more complex when the tree depth is bigger than 1:

  • L2 score function: S is converted into a sum over leaves S(a,g)=leafS(aleaf,gleaf)S(a,g) = \sum_{leaf} S(a_{leaf}, g_{leaf}). The next step is to find j,c=argmaxj,cS(aˉ,g)j*, c* = argmax_{j,c}{S(\bar a, g)}, where aˉ\bar a are the optimal values in leaves after the j,cj*, c* split.
  • Depthwise and Lossguide methods: j,cj, c are sets of {jk},{ck}\{j_k\}, \{c_k\}. kk stands for the index of the leaf, therefore the score function SS takes the following form: S(aˉ,g)=l=leafS(aˉ(jl,cl),gl)S(\bar a, g) = \sum_{l = leaf}S(\bar a(j_l, c_l), g_l). Since S(leaf)S(leaf) is a convex function, different jk1,ck1j_{k1}, c_{k1} and jk2,ck2j_{k2}, c_{k2} (splits for different leaves) can be searched separately by finding the optimal j,c=argmaxj,c{S(leafleft)+S(leafright)S(leafbefore_split)}j*, c* = argmax_{j,c}\{S(leaf_{left}) + S(leaf_{right}) - S(leaf_{before\_split})\}.
  • SymmetricTree method: The same j,cj, c are attempted to be found for each leaf, thus it's required to optimize the total sum over all leaves S(a,g)=leafS(leaf)S(a,g) = \sum_{leaf} S(leaf).

Second-order leaf estimation method

Let's apply the Taylor expansion to the loss function at the point at1=Ft1(x)a^{t-1} = F^{t-1}(x):

L(ait1+ϕ,y)wi[li+liϕ+12liϕ2]+12λϕ2,where:L(a^{t-1}_{i} + \phi , y) \approx \displaystyle\sum w_{i} \left[ l_{i} + l^{'}_{i} \phi + \frac{1}{2} l^{''}_{i} \phi^{2} \right] + \frac{1}{2} \lambda ||\phi||_{2}{ , where:}

  • li=l(ait1,yi)l_{i} = l(a^{t-1}_{i}, y_{i})
  • li=l(a,yi)aa=ait1l'_{i} = -\frac{\partial l(a, y_{i})}{\partial a}\Bigr|_{a=a^{t-1}_{i}}
  • li=2l(a,yi)a2a=ait1l''_{i} = -\frac{\partial^{2} l(a, y_{i})}{\partial a^{2}}\displaystyle\Bigr|_{a=a^{t-1}_{i}}
  • λ\lambda is the l2 regularization parameter

Since the first term is constant in optimization, the formula takes the following form after regrouping by leaves:

leaf=1L(ileafwi[li+liϕleaf+12liϕ2]+12λϕleaf2)min\sum\limits_{leaf=1}^{L} \left( \sum\limits_{i \in leaf} w_{i} \left[ l_{i} + l^{'}_{i} \phi_{leaf} + \frac{1}{2} l^{''}_{i} \phi^{2} \right] + \frac{1}{2} \lambda \phi_{leaf}^{2} \right) \to min

Then let's minimize this expression for each leaf independently:

ileafwi[li+liϕleaf+12liϕleaf2]+12λϕleaf2min\sum\limits_{i \in leaf} w_{i} \left[ l_{i} + l^{'}_{i} \phi_{leaf} + \frac{1}{2} l^{''}_{i} \phi^{2}_{leaf} \right] + \frac{1}{2} \lambda \phi_{leaf}^2 \to min

Differentiate by leaf value ϕleaf\phi_{leaf}:

ileafwi[li+liϕleaf]+λϕleaf=0\sum\limits_{i \in leaf} w_{i} \left[ l^{'}_{i} + l^{''}_{i} \phi_{leaf} \right] + \lambda \phi_{leaf} = 0

So, the optimal value of ϕleaf\phi_{leaf} is:

iwiliiwili+λ- \displaystyle\frac{\sum_{i}w_{i}l^{'}_{i}}{\sum_{i}w_{i}l^{''}_{i}+\lambda}

The summation is over ii such that the object xix_{i} gets to the considered leaf. Then these optimal values of ϕleaf\phi_{leaf} can be used instead of weighted averages of gradients (alefta^{*}_{left} and arighta^{*}_{right} in the example above) in the same score functions.

CatBoost score functions

CatBoost provides the following score functions:
Score function: L2

Description

Use the first derivatives during the calculation.

Score function: Cosine (can not be used with the Lossguide tree growing policy)

Score function: NewtonL2

Description

Use the second derivatives during the calculation. This may improve the resulting quality of the model.

Score function: NewtonCosine (can not be used with the Lossguide tree growing policy)

Per-object and per-feature penalties

CatBoost provides the following methods to affect the score with penalties:

  • Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model.

  • Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time.

The final score is calculated as follows:
Score=ScorefSWffSPfU(f)fSxLEPfU(f,x)Score' = Score \cdot \prod_{f\in S}W_{f} - \sum_{f\in S}P_{f} \cdot U(f) - \sum_{f\in S}\sum_{x \in L}EP_{f} \cdot U(f, x)

  • WfW_{f}  is the feature weight
  • PfP_{f}  is the per-feature penalty
  • EPfEP_{f} is the per-object penalty
  • SS is the current split
  • LL is the current leaf
  • U(f)={0,if f was used in model already1,otherwiseU(f) = \begin{cases} 0,& \text{if } f \text{ was used in model already}\\ 1,& \text{otherwise} \end{cases}
  • U(f,x)={0,if f was used already for object x1,otherwiseU(f, x) = \begin{cases} 0,& \text{if } f \text{ was used already for object } x\\ 1,& \text{otherwise} \end{cases}

Usage

Use the corresponding parameter to set the score function during the training:

Alert

The supported score functions vary depending on the processing unit type:

  • GPU — All score types

  • CPU — Cosine, L2

Python package: score_function

R package: score_function

Command-line interface: --score-function

Description

The score type used to select the next split during the tree construction.

Possible values:

  • Cosine (do not use this score type with the Lossguide tree growing policy)
  • L2
  • NewtonCosine (do not use this score type with the Lossguide tree growing policy)
  • NewtonL2