Table of contents

8  Extreme Gradient Boosting

Extreme gradient boosting (XGB) has been extremely popular due to its superb performance (). It is a variant of gradient boosting and follows the same steps as gradient boosting covered in Chapter 7. However, it has it has its own way of building a tree, which is more mindful of avoiding over-fitting trees.

Note
  1. Set f0(Xi)=i=1NyiN for all i=1,,N
  2. For b = 1 to B,
  1. For i=1,,N, calculate ri,b=(yifb1(Xi))
  2. Build a tree XGB-way
  3. Update fb1(Xi) to fb(Xi)
  1. Finally, f^(Xi)=fB(Xi)

XGB can be implemented by the XGBoost package (for both R and Python). They have several different ways of build trees (Step 2.ii). While we discuss only a variant of the available algorithms in this chapter, understanding of the general idea of XGB can still be gained.

See here for various tree building algorithms.

8.1 Tree updating in XGB (general case)

Let fi,b(xi) be the prediction for the ith observation at the b-th iteration. Further, let Ij denote a set of observations that belong to leaf j (j=1,,J) of the tree that is built at Step 2.ii. Then, at Step 2.iii, fi,b1(xi) is updated to fi,b(xi) according to the following:

fi,b(xi)=fi,b1(xi)+ηwj(xi)

Ideally, it would be great to find wj(xi) is such that it minimizes the following objective:

(8.1)Ψ=i=1N[L(yi,fi,b(xi)+wj(xi))]+Ω(wj)

where L() is the user-specified loss-function that is differentiable and Ω(wj) is the regularization term. However, solving this minimization problem exactly can be too complicated depending on the loss function specification. Instead of Equation 8.1, XGB uses the second order Taylor expansion of L() about w.

(8.2)Ψ~=i=1N[L(yi,fi,b(xi))+giwj(xi)+12hiwj(xi)2]+Ω(wj)

where gi=L(yi,pi)pi (first-order derivative) and hi=2L(yi,pi)pi2 (second-order derivative). Since L(yi,fi,b(xi)) is just a constant, we can safely remove it from the objective function, which leads to

(8.3)Ψ~t=i=1N[giwj(xi)+12hiwj(xi)2]+Ω(wj)

Suppose the L2 norm is used and Ω(wj)=12λj=1Jwj2.

Let Ij denote a set of observations that belong to leaf j (j=1,,J). Then, Equation 8.3 is written as follows:

(8.4)Ψ~t=j=1J[(iIjgi)wj+12(iIjhi+λ)wj2]+γJ

Remember that all the observations in the same leaf shares the same prediction. So, for all is that belong to leaf j, the prediction is denoted as wj in Equation 8.4. That is, wt(xi) that belongs to leaf j is wj.

For a given tree structure (denoted as q(x)), the leaves can be treated independently in minimizing this objective.

Taking the derivative of Ψ~t w.r.t wj,

(8.5)(iIjgi)+(iIjhi+λ)wj=0wj=iIjgiiIjhi+λ

The minimized value of Ψ~t is then (obtained by plugging wj into Equation 8.4),

(8.6)Ψ~t(q)=j=1J[(iIjgi)iIjgiiIjhi+λ+12(iIjhi+λ)(iIjgiiIjhi+λ)2]+γJ=j=1J[(iIjgi)2iIjhi+λ+12(iIjgi)2iIjhi+λ]+γJ=12j=1J[(iIjgi)2iIjhi+λ]+γJ

For rotational convenience, we call (iIjgi)2iIjhi+λ quality score and denote it by Qj ( Quality score for leaf j).

We could find the best tree structure by finding wj(q) according to Equation 8.4 and calculate Ψ~t(q) according to Equation 8.6 for each of all the possible tree structures, and then pick the tree structure q(x) that has the lowest Ψ~t(q).

However, it is impossible to consider all possible tree structures practically. So, a greedy (myopic) approach that starts from a single leaf and iteratively splits leaves is used instead.

Consider splitting an existing leaf p (where in the tree it may be located) into two leaves L and R according to splitting rule s when there are J existing leaves. Then, the resulting minimized objective is

12[QL(s)+QR(s)+Γ]+γ(J+1)

where Γ is the sum of quality scores for all the leaves except L and R.

Γ=j{L,R}JQj

The minimized objective before splitting is

12[Qp+Γ]+γJ

So, the reduction in loss after the split is

G(s,p)=12[QL(s)+QR(s)Qp]γ

Let’s call G(s,p) simply a gain of split parent node p using splitting rule s.

A more positive value of gain (G(s,L,R)) means a more successful split.

We calculate the gain for all the possible splits of p and pick the split that has the highest gain.

Different patterns of IL and IR arise from different variable-cutpoint combinations

If the highest gain is negative, then leaf p is not split.

Once the best tree is built, then we update our prediction based on w of the terminal nodes of the tree. For observation i that belongs to leaf j of the tree,

(8.7)fi,b=fi,b1+ηwj

where η is the learning rate.

8.2 Tree updating in XGB (regression)

We now make the general tree updating algorithm specific to regression problems, where the loss function is squared error: L(yi,pi)=12(yipi)2, where pi is the predicted value for i.

First, let’s find gi and hi for L(yi,pi)=12(yipi)2.

(8.8)gi=L(yi,pi)pi=(yipi)

(8.9)hi=2L(yi,pi)pi2=1

So, gi is simply the negative of the residual for i.

Now, suppose you are at iteration b and the predicted value for i is denoted as fi,b(xi). Further, let ri,b denote the residual (yifi,b(xi)).

Given this, here is how a tree is built at iteration b.

Note

Steps 2.ii and 2.iii of the XGB algorithm:

  1. For a given splitting rule (s) among all the possible splits (denoted as leaves L and R) of parent node p, calculate the quality score (Q) and then gain (G):

Qj(s)=(iIj(s)ri,b)2Nj+λ,j={L,R}

G(s)=12[QL(s)+QR(s)Qp]γ

where Qp is the quality score of the parent node.

  1. Find the splitting rule that produces the highest gain. If the highest gain is positive, then split the parent node according to the associated splitting rule, otherwise do not split the parent node.

  2. Repeat 1 and 2 until no further splits are possible.

  3. For each observation, calculate w. For all the observations that belong to leaf j, w can be obtained as follows (plugging Equation 8.9 and Equation 8.8 into Equation 8.5)

wj=iIjri,biIj1+λ=iIjri,bNj+λ

  1. Update the prediction for each observation according to

fi,b=fi,b1+ηwj

That is, for a given leaf j, the optimal predicted value (wj) is the sum of the residuals of all the observations in leaf j divided by the number of observations in leaf j plus λ. When λ=0, the optimal predicted value (wj) is simply the mean of the residuals.

8.3 Illustration of XGB for regression

Packages to load for replication

library(tidyverse)
library(data.table)

In order to further our understanding of the entire XGB algorithm, let’s take a look at a simple regression problem as an illustration. We consider a four-observation data as follows:

(
data <-
  data.table(
    y = c(-3, 7, 8, 12),
    x = c(1, 4, 6, 8)
  )
)
    y x
1: -3 1
2:  7 4
3:  8 6
4: 12 8
Code
(
g_0 <-
  ggplot(data) +
  geom_point(aes(y = y, x = x)) +
  theme_bw()
)

First step (b=0) is to make an initial prediction. This can be any number, but let’s use the mean of y and set it as the predicted value for all the observations.

(
f_0 <- mean(data$y) # f_0: the predicted value for all the observations
)
[1] 6

Let’s set γ, λ, and η to 10, 1, and 0.3, respectively.

gamma <- 10
lambda <- 1
eta <- 0.3

We have a single-leaf tree at the moment. And the quality score for this leaf is

quality score for leaf j is (iIjri,b)2Nj+λ

#=== get residuals ===#
data[, resid := y - f_0]

#=== get quality score ===#
(
q_0 <- (sum(data$resid))^2/(nrow(data) + lambda)
)
[1] 0

Quality score of the leaf is 0.

Since we are using the mean of y as the prediction, of course, the sum of the residuals is zero, which then means that the quality score is zero.

Now, we have three potential to split patterns: {x, 2}, {x, 5}, {x, 7}.

{x, 2} means the leaf is split into two leaves: x|x<2 and x|x>=2. Note that any number between 1 and 4 will result in the same split results.

Let’s consider them one by one.

8.3.0.1 Split: {x, 2}

Here is the graphical representations of the split:

Code
g_0 +
  geom_vline(xintercept = 2, color = "red") +
  annotate("text", x = 1.25, y = 6, label = "leaf L", color = "red") +
  annotate("text", x = 5, y = 6, label = "leaf R", color = "red") +
  theme_bw()

Code
DiagrammeR::grViz(
"
digraph {
  graph [ranksep = 0.2]
  node [shape = box]
    T1R [label = 'L: -9']
    T1L [label = 'R: 1 , 2 , 6']
    T0 [label = '-9, 1 , 2 , 6']
  edge [minlen = 2]
    T0->T1L
    T0->T1R
  { rank = same; T1R; T1L}
}
"
)
%0 T1R L: -9 T1L R: 1 , 2 , 6 T0 -9, 1 , 2 , 6 T0->T1R T0->T1L

Let’s split the data.

#=== leaf L ===#
(
data_L_1 <- data[x < 2, ]
)
    y x resid
1: -3 1    -9
#=== leaf R ===#
(
data_R_1 <- data[x >= 2, ]
)
    y x resid
1:  7 4     1
2:  8 6     2
3: 12 8     6

Using ?eq-w-reg,

wj=iIjri,bNj+λ

w_L <- (sum(data_L_1$resid))/(nrow(data_L_1) + lambda)
w_R <- (sum(data_R_1$resid))/(nrow(data_R_1) + lambda)

wL=9/(1+1)=4.5wR=1+2+6/(3+1)=2.25

Using ?eq-q-reg, the quality scores for the leaves are

Qj=(iIjri,b)2Nj+λ

q_L <- (sum(data_L_1$resid))^2/(nrow(data_L_1) + lambda)
q_R <- (sum(data_R_1$resid))^2/(nrow(data_R_1) + lambda)
Code
DiagrammeR::grViz(
  paste0(
  "
  digraph {
    graph [ranksep = 0.2]
    node [shape = box]
      T1R [label = 'L: -9 \n Q score = ", round(q_L, digits = 2), "']
      T1L [label = 'R: 1 , 2 , 6 \n Q score = ", round(q_R, digits = 2), "']
      T0 [label = '-9, 1 , 2 , 6']
    edge [minlen = 2]
      T0->T1L
      T0->T1R
    { rank = same; T1R; T1L}
  }
  "
  )

)
%0 T1R L: -9 Q score = 40.5 T1L R: 1 , 2 , 6 Q score = 20.25 T0 -9, 1 , 2 , 6 T0->T1R T0->T1L

qL=(9)2/(1+1)=40.5qR=(1+2+6)2/(3+1)=20.25

Notice that residuals are first summed and then squared in the denominator of the quality score (the higher, the better). This means that if the prediction is off in the same direction (meaning they are similar) among the observations within the leaf, then the quality score is higher. On the other hand, if the prediction is off in both directions (meaning they are not similar), then the residuals cancel each other out, resulting in a lower quality score. Since we would like to create leaves consisting of similar observations, a more successful split has a higher quality score.

Finally, the gain of this split is

G(s,L,R)=12[QL+QRQs]γ

where s is the leaf before split, L and R are leaves after the split of leaf s.

gain_1 <- (q_L + q_R - q_0)/2 - gamma

G1=40.5+20.250210=20.375

Now that we have gone through the process of finding update value (w), quality score (q), and gain (G) for a given split structure, let’s write a function that returns the values of these measures by feeding the cutpoint before moving onto the next split candidate.

get_info <- function(data, cutpoint, lambda, gamma)
{
  q_0 <- (sum(data$resid))^2/(nrow(data) + lambda)

  data_L <- data[x < cutpoint, ]
  data_R <- data[x >= cutpoint, ]

  w_L <- (sum(data_L$resid))/(nrow(data_L) + lambda)
  w_R <- (sum(data_R$resid))/(nrow(data_R) + lambda)

  q_L <- (sum(data_L$resid))^2/(nrow(data_L) + lambda)
  q_R <- (sum(data_R$resid))^2/(nrow(data_R) + lambda)

  gain <- (q_L + q_R - q_0)/2 - gamma

  return(list(
    w_L = w_L, 
    w_R = w_R, 
    q_L = q_L, 
    q_R = q_R, 
    gain = gain 
  ))
}

8.3.0.2 Split: {x, 5}

measures_2 <- get_info(data, 5, lambda, gamma)
Code
g_0 +
  geom_vline(xintercept = 5, color = "red") +
  annotate("text", x = 3, y = 6, label = "leaf L", color = "red") +
  annotate("text", x = 7, y = 6, label = "leaf R", color = "red") +
  theme_bw()

Code
DiagrammeR::grViz(
  paste0(
    "
    digraph {
      graph [ranksep = 0.2]
      node [shape = box]
        T1R [label = 'L: -9, 1 \n Q score = ", round(measures_2$q_L, digits = 2), "']
        T1L [label = 'R: 2 , 6 \n Q score = ", round(measures_2$q_R, digits = 2), "']
        T0 [label = '-9, 1 , 2 , 6']
      edge [minlen = 2]
        T0->T1L
        T0->T1R
      { rank = same; T1R; T1L}
    }
    "
  )
)
%0 T1R L: -9, 1 Q score = 21.33 T1L R: 2 , 6 Q score = 21.33 T0 -9, 1 , 2 , 6 T0->T1R T0->T1L

qL=(9+1)2/(2+1)=21.33qR=(2+6)2/(2+1)=21.33

G2=21.33+21.330210=11.3333333

8.3.0.3 Split: {x, 7}

measures_3 <- get_info(data, 7, lambda, gamma)
Code
g_0 +
  geom_vline(xintercept = 7, color = "red") +
  annotate("text", x = 4, y = 6, label = "leaf L", color = "red") +
  annotate("text", x = 8, y = 6, label = "leaf R", color = "red") +
  theme_bw()

Code
DiagrammeR::grViz(
  paste0(
    "
    digraph {
      graph [ranksep = 0.2]
      node [shape = box]
        T1R [label = 'L: -9, 1, 2 \n Q score = ", round(measures_3$q_L, digits = 2), "']
        T1L [label = 'R: 6 \n Q score = ", round(measures_3$q_R, digits = 2), "']
        T0 [label = '-9, 1 , 2 , 6']
      edge [minlen = 2]
        T0->T1L
        T0->T1R
      { rank = same; T1R; T1L}
    }
    "
  )
)
%0 T1R L: -9, 1, 2 Q score = 9 T1L R: 6 Q score = 18 T0 -9, 1 , 2 , 6 T0->T1R T0->T1L

qL=(9+1+2)2/(3+1)=9qR=(6)2/(1+1)=18

G3=9+180210=3.5

Among all the splits we considered, the first case (Split: {x, 2}) has the highest score. This is easy to confirm visually and shows picking a split based on the gain measure indeed makes sense.

Now we consider how to split leaf R (leaf L cannot be split further as it has only one observation). We have two split candidates: {x, 5} and {x, 7}. Let’s get the gain measures using get_info().

#=== first split ===#
get_info(data_R_1, 5, lambda, gamma)$gain 
[1] -9.208333
#=== second split ===#
get_info(data_R_1, 7, lambda, gamma)$gain
[1] -9.625

So, neither of the splits has a positive gain value. Therefore, we do not adopt either of the splits. For this iteration (b=1), this is the end of tree building.

Note

If the value of γ is lower (say, 0), then we would have adopted the second split.

get_info(data_R_1, 5, lambda, 0)$gain # first split
[1] 0.7916667
get_info(data_R_1, 7, lambda, 0)$gain # second split
[1] 0.375

As you can see, a higher value of γ leads to a more aggressive tree pruning.

So, the final tree for this iteration (b=1) is

Code
DiagrammeR::grViz(
  paste0(
  "
  digraph {
    graph [ranksep = 0.2]
    node [shape = box, width = 0.3, height = 0.15, fontsize = 3, fixedsize = TRUE, penwidth = 0.2]
      T1R [label = 'L: -9 \n w* = ", round(w_L, digits = 2), "']
      T1L [label = 'R: 1 , 2 , 6 \n w* = ", round(w_R, digits = 2), "']
      T0 [label = '-9, 1 , 2 , 6']
    edge [penwidth = 0.2, arrowsize = 0.3, len = 0.3]
      T0->T1L
      T0->T1R
    { rank = same; T1R; T1L}
  }
  "
  )

)
%0 T1R L: -9 w* = -4.5 T1L R: 1 , 2 , 6 w* = 2.25 T0 -9, 1 , 2 , 6 T0->T1R T0->T1L

We now use w from this tree to update our prediction according to Equation 8.7.

fi,b=fi,b1+ηwj

measures_1 <- get_info(data, 2, lambda, gamma)

Since the first observation is in L,

fi=1,b=1=6+0.3×4.5=4.65

Since the second, third, and fourth observations are in R,

fi=2,b=1=6+0.3×2.25=6.68fi=3,b=1=6+0.3×2.25=6.68fi=4,b=1=6+0.3×2.25=6.68

data %>% 
  .[, f_0 := f_0] %>% 
  .[1, f_1 := f_0 + measures_1$w_L * eta] %>%
  .[2:4, f_1 := f_0 + measures_1$w_R * eta]

The prediction updates can be seen below. Though small, we made small improvements in our prediction.

Code
ggplot(data = data) +
  geom_point(aes(y = y, x = x, color = "observed")) +
  geom_point(aes(y = f_1, x = x, color = "after (f1)")) +
  geom_point(aes(y = f_0, x = x, color = "before (f0)")) +
  scale_color_manual(
    values = 
      c(
        "before (f0)" = "blue", 
        "after (f1)" = "red",
        "observed" = "black"
      ), 
    name = ""
  ) +
  geom_segment(
    aes(y = f_0, x = x, yend = f_1, xend = x), 
    color = "blue",
    arrow = arrow(length = unit(0.1, "cm"))
  ) +
  theme_bw()

Now, we move on to b=2. We first update residuals:

data[, resid := y - f_1]

data
    y x  resid f_0   f_1
1: -3 1 -7.650   6 4.650
2:  7 4  0.325   6 6.675
3:  8 6  1.325   6 6.675
4: 12 8  5.325   6 6.675

Just like at b=1, all the possible splits are {x, 2}, {x, 5}, {x, 7}. Let’s find the gain for each split.

lapply(
  c(2, 5, 7),
  function(x) get_info(data, x, lambda, gamma)$gain
)
[[1]]
[1] 10.66639

[[2]]
[1] 6.267458

[[3]]
[1] 1.543344

So, the first split is again the best split. Should we split the right leaf, which has the observations except the first one?

lapply(
  c(5, 7),
  function(x) get_info(data[2:3, ], x, lambda, gamma)$gain
)
[[1]]
[1] -9.988437

[[2]]
[1] -10

All the splits have negative gains. So, we do not split this leaf just like at b=1.

So, the final tree for this iteration (b=2) is

Code
measures_b2 <- get_info(data, 2, lambda, gamma)


DiagrammeR::grViz(
  paste0(
  "
  digraph {
    graph [ranksep = 0.2]
    node [shape = box, width = 0.4, height = 0.15, fontsize = 3, fixedsize = TRUE, penwidth = 0.2]
      T1R [label = 'L: -8.18 \n w* = ", round(measures_b2$w_L, digits = 2), "']
      T1L [label = 'R: 0.71 , 1.71 , 5.71 \n w* = ", round(measures_b2$w_R, digits = 2), "']
      T0 [label = '-8.18, 0.71 , 1.71 , 5.71']
    edge [penwidth = 0.2, arrowsize = 0.3, len = 0.3]
      T0->T1L
      T0->T1R
    { rank = same; T1R; T1L}
  }
  "
  )

)
%0 T1R L: -8.18 w* = -3.83 T1L R: 0.71 , 1.71 , 5.71 w* = 1.74 T0 -8.18, 0.71 , 1.71 , 5.71 T0->T1R T0->T1L

Let’s now update our predictions.

data %>% 
  .[1, f_2 := f_1 + measures_b2$w_L * eta] %>%  
  .[2:4, f_2 := f_1 + measures_b2$w_R * eta] 
Code
ggplot(data = data) +
  geom_point(aes(y = y, x = x, color = "observed"), shape = 0) +
  geom_point(aes(y = f_2, x = x, color = "f2")) +
  geom_point(aes(y = f_1, x = x, color = "f1")) +
  geom_point(aes(y = f_0, x = x, color = "f0")) +
  scale_color_manual(
    values = 
      c(
        "f0" = "blue", 
        "f1" = "red",
        "f2" = "red",
        "observed" = "black"
      ), 
    name = ""
  ) +
  geom_segment(
    aes(y = f_0, x = x, yend = f_1, xend = x), 
    color = "blue",
    arrow = arrow(length = unit(0.2, "cm"))
  ) +
  geom_segment(
    aes(y = f_1, x = x, yend = f_2, xend = x), 
    color = "blue",
    arrow = arrow(length = unit(0.2, "cm"))
  ) +
  theme_bw()

Again, we made small improvements in our predictions. This process continues until user-specified stopping criteria is met.

Tip
  • λ:
    • A higher value of λ leads to a lower value of prediction updates (w).
    • A higher value of λ leads to a lower value of quality score (Q), thus leading to a lower value of gain (G), which then leads to more aggressive pruning for a given value of γ.
  • γ:
    • A higher value of γ leads to more aggressive pruning.
  • η:
    • A higher value of η leads to faster learning.

8.4 Implementation

You can use the xgboost package to implement XGB modeling.

library(xgboost)

The first task is to create a class of matrix called xgb.DMatrix using the xgb.DMatrix() function. You provide the explanatory variable data matrix to the data option and the dependent variable matrix (vector) to the label option in xgb.DMatrix() like below.

Let’s get the mlb1 data from the wooldridge package for demonstration.

library(wooldridge)
data(mlb1)

mlb1_dt <- 
  mlb1 %>% 
  data.table() %>% # turn into data.table 
  .[, salary := NULL] %>% # remove salary (use lsalary instead)
  na.omit() # remove observations with NA in any of the variables
mlb1_dm_X <- 
  xgb.DMatrix(
    data = as.matrix(mlb1_dt[, .(hruns, years, rbisyr, allstar, runsyr, hits, bavg)]),
    label = as.matrix(mlb1_dt[, lsalary])
  )

We can then use xgb.train() to train a model using the XGB algorithm.

xgb_fit <-
  xgb.train(
    data = mlb1_dm_X, # independent variable
    nrounds = 100, # number of iterations (trees to add)
    eta = 1, # learning rate
    objective = "reg:squarederror" # objective function
  )

8.5 Resources

Chen, Tianqi, and Carlos Guestrin. 2016. “Xgboost: A Scalable Tree Boosting System.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 785–94.

  1. This helps for some of the commonly used loss functions↩︎