library(tidyverse)
library(data.table)
8 Extreme Gradient Boosting
Extreme gradient boosting (XGB) has been extremely popular due to its superb performance (Chen and Guestrin 2016). It is a variant of gradient boosting and follows the same steps as gradient boosting covered in Chapter 7. However, it has it has its own way of building a tree, which is more mindful of avoiding over-fitting trees.
XGB can be implemented by the XGBoost
package (for both R and Python). They have several different ways of build trees (Step 2.ii). While we discuss only a variant of the available algorithms in this chapter, understanding of the general idea of XGB can still be gained.
See here for various tree building algorithms.
8.1 Tree updating in XGB (general case)
Let
Ideally, it would be great to find
where
where
Suppose the L2 norm is used and
Let
Remember that all the observations in the same leaf shares the same prediction. So, for all
For a given tree structure (denoted as
Taking the derivative of
The minimized value of
For rotational convenience, we call
We could find the best tree structure by finding
However, it is impossible to consider all possible tree structures practically. So, a greedy (myopic) approach that starts from a single leaf and iteratively splits leaves is used instead.
Consider splitting an existing leaf
where
The minimized objective before splitting is
So, the reduction in loss after the split is
Let’s call
A more positive value of gain (
We calculate the gain for all the possible splits of
Different patterns of
If the highest gain is negative, then leaf
Once the best tree is built, then we update our prediction based on
where
8.2 Tree updating in XGB (regression)
We now make the general tree updating algorithm specific to regression problems, where the loss function is squared error:
First, let’s find
So,
Now, suppose you are at iteration
Given this, here is how a tree is built at iteration
That is, for a given leaf
8.3 Illustration of XGB for regression
Packages to load for replication
In order to further our understanding of the entire XGB algorithm, let’s take a look at a simple regression problem as an illustration. We consider a four-observation data as follows:
(<-
data data.table(
y = c(-3, 7, 8, 12),
x = c(1, 4, 6, 8)
) )
y x
1: -3 1
2: 7 4
3: 8 6
4: 12 8
Code
(<-
g_0 ggplot(data) +
geom_point(aes(y = y, x = x)) +
theme_bw()
)
First step (y
and set it as the predicted value for all the observations.
(<- mean(data$y) # f_0: the predicted value for all the observations
f_0 )
[1] 6
Let’s set
<- 10
gamma <- 1
lambda <- 0.3 eta
We have a single-leaf tree at the moment. And the quality score for this leaf is
quality score for leaf
#=== get residuals ===#
:= y - f_0]
data[, resid
#=== get quality score ===#
(<- (sum(data$resid))^2/(nrow(data) + lambda)
q_0 )
[1] 0
Quality score of the leaf is 0.
Since we are using the mean of
Now, we have three potential to split patterns: {x
, 2}, {x
, 5}, {x
, 7}.
{x
, 2} means the leaf is split into two leaves:
Let’s consider them one by one.
8.3.0.1 Split: {x
, 2}
Here is the graphical representations of the split:
Code
+
g_0 geom_vline(xintercept = 2, color = "red") +
annotate("text", x = 1.25, y = 6, label = "leaf L", color = "red") +
annotate("text", x = 5, y = 6, label = "leaf R", color = "red") +
theme_bw()
Code
::grViz(
DiagrammeR"
digraph {
graph [ranksep = 0.2]
node [shape = box]
T1R [label = 'L: -9']
T1L [label = 'R: 1 , 2 , 6']
T0 [label = '-9, 1 , 2 , 6']
edge [minlen = 2]
T0->T1L
T0->T1R
{ rank = same; T1R; T1L}
}
"
)
Let’s split the data.
#=== leaf L ===#
(<- data[x < 2, ]
data_L_1 )
y x resid
1: -3 1 -9
#=== leaf R ===#
(<- data[x >= 2, ]
data_R_1 )
y x resid
1: 7 4 1
2: 8 6 2
3: 12 8 6
Using ?eq-w-reg,
<- (sum(data_L_1$resid))/(nrow(data_L_1) + lambda)
w_L <- (sum(data_R_1$resid))/(nrow(data_R_1) + lambda) w_R
Using ?eq-q-reg, the quality scores for the leaves are
<- (sum(data_L_1$resid))^2/(nrow(data_L_1) + lambda)
q_L <- (sum(data_R_1$resid))^2/(nrow(data_R_1) + lambda) q_R
Code
::grViz(
DiagrammeRpaste0(
"
digraph {
graph [ranksep = 0.2]
node [shape = box]
T1R [label = 'L: -9 \n Q score = ", round(q_L, digits = 2), "']
T1L [label = 'R: 1 , 2 , 6 \n Q score = ", round(q_R, digits = 2), "']
T0 [label = '-9, 1 , 2 , 6']
edge [minlen = 2]
T0->T1L
T0->T1R
{ rank = same; T1R; T1L}
}
"
)
)
Notice that residuals are first summed and then squared in the denominator of the quality score (the higher, the better). This means that if the prediction is off in the same direction (meaning they are similar) among the observations within the leaf, then the quality score is higher. On the other hand, if the prediction is off in both directions (meaning they are not similar), then the residuals cancel each other out, resulting in a lower quality score. Since we would like to create leaves consisting of similar observations, a more successful split has a higher quality score.
Finally, the gain of this split is
where
<- (q_L + q_R - q_0)/2 - gamma gain_1
Now that we have gone through the process of finding update value (
<- function(data, cutpoint, lambda, gamma)
get_info
{<- (sum(data$resid))^2/(nrow(data) + lambda)
q_0
<- data[x < cutpoint, ]
data_L <- data[x >= cutpoint, ]
data_R
<- (sum(data_L$resid))/(nrow(data_L) + lambda)
w_L <- (sum(data_R$resid))/(nrow(data_R) + lambda)
w_R
<- (sum(data_L$resid))^2/(nrow(data_L) + lambda)
q_L <- (sum(data_R$resid))^2/(nrow(data_R) + lambda)
q_R
<- (q_L + q_R - q_0)/2 - gamma
gain
return(list(
w_L = w_L,
w_R = w_R,
q_L = q_L,
q_R = q_R,
gain = gain
)) }
8.3.0.2 Split: {x
, 5}
<- get_info(data, 5, lambda, gamma) measures_2
Code
+
g_0 geom_vline(xintercept = 5, color = "red") +
annotate("text", x = 3, y = 6, label = "leaf L", color = "red") +
annotate("text", x = 7, y = 6, label = "leaf R", color = "red") +
theme_bw()
Code
::grViz(
DiagrammeRpaste0(
"
digraph {
graph [ranksep = 0.2]
node [shape = box]
T1R [label = 'L: -9, 1 \n Q score = ", round(measures_2$q_L, digits = 2), "']
T1L [label = 'R: 2 , 6 \n Q score = ", round(measures_2$q_R, digits = 2), "']
T0 [label = '-9, 1 , 2 , 6']
edge [minlen = 2]
T0->T1L
T0->T1R
{ rank = same; T1R; T1L}
}
"
) )
8.3.0.3 Split: {x
, 7}
<- get_info(data, 7, lambda, gamma) measures_3
Code
+
g_0 geom_vline(xintercept = 7, color = "red") +
annotate("text", x = 4, y = 6, label = "leaf L", color = "red") +
annotate("text", x = 8, y = 6, label = "leaf R", color = "red") +
theme_bw()
Code
::grViz(
DiagrammeRpaste0(
"
digraph {
graph [ranksep = 0.2]
node [shape = box]
T1R [label = 'L: -9, 1, 2 \n Q score = ", round(measures_3$q_L, digits = 2), "']
T1L [label = 'R: 6 \n Q score = ", round(measures_3$q_R, digits = 2), "']
T0 [label = '-9, 1 , 2 , 6']
edge [minlen = 2]
T0->T1L
T0->T1R
{ rank = same; T1R; T1L}
}
"
) )
Among all the splits we considered, the first case (Split: {x
, 2}) has the highest score. This is easy to confirm visually and shows picking a split based on the gain measure indeed makes sense.
Now we consider how to split leaf R (leaf L cannot be split further as it has only one observation). We have two split candidates: {x
, 5} and {x
, 7}. Let’s get the gain measures using get_info()
.
#=== first split ===#
get_info(data_R_1, 5, lambda, gamma)$gain
[1] -9.208333
#=== second split ===#
get_info(data_R_1, 7, lambda, gamma)$gain
[1] -9.625
So, neither of the splits has a positive gain value. Therefore, we do not adopt either of the splits. For this iteration (
So, the final tree for this iteration (
Code
::grViz(
DiagrammeRpaste0(
"
digraph {
graph [ranksep = 0.2]
node [shape = box, width = 0.3, height = 0.15, fontsize = 3, fixedsize = TRUE, penwidth = 0.2]
T1R [label = 'L: -9 \n w* = ", round(w_L, digits = 2), "']
T1L [label = 'R: 1 , 2 , 6 \n w* = ", round(w_R, digits = 2), "']
T0 [label = '-9, 1 , 2 , 6']
edge [penwidth = 0.2, arrowsize = 0.3, len = 0.3]
T0->T1L
T0->T1R
{ rank = same; T1R; T1L}
}
"
)
)
We now use
<- get_info(data, 2, lambda, gamma) measures_1
Since the first observation is in
Since the second, third, and fourth observations are in
%>%
data := f_0] %>%
.[, f_0 1, f_1 := f_0 + measures_1$w_L * eta] %>%
.[2:4, f_1 := f_0 + measures_1$w_R * eta] .[
The prediction updates can be seen below. Though small, we made small improvements in our prediction.
Code
ggplot(data = data) +
geom_point(aes(y = y, x = x, color = "observed")) +
geom_point(aes(y = f_1, x = x, color = "after (f1)")) +
geom_point(aes(y = f_0, x = x, color = "before (f0)")) +
scale_color_manual(
values =
c(
"before (f0)" = "blue",
"after (f1)" = "red",
"observed" = "black"
), name = ""
+
) geom_segment(
aes(y = f_0, x = x, yend = f_1, xend = x),
color = "blue",
arrow = arrow(length = unit(0.1, "cm"))
+
) theme_bw()
Now, we move on to
:= y - f_1]
data[, resid
data
y x resid f_0 f_1
1: -3 1 -7.650 6 4.650
2: 7 4 0.325 6 6.675
3: 8 6 1.325 6 6.675
4: 12 8 5.325 6 6.675
Just like at x
, 2}, {x
, 5}, {x
, 7}. Let’s find the gain for each split.
lapply(
c(2, 5, 7),
function(x) get_info(data, x, lambda, gamma)$gain
)
[[1]]
[1] 10.66639
[[2]]
[1] 6.267458
[[3]]
[1] 1.543344
So, the first split is again the best split. Should we split the right leaf, which has the observations except the first one?
lapply(
c(5, 7),
function(x) get_info(data[2:3, ], x, lambda, gamma)$gain
)
[[1]]
[1] -9.988437
[[2]]
[1] -10
All the splits have negative gains. So, we do not split this leaf just like at
So, the final tree for this iteration (
Code
<- get_info(data, 2, lambda, gamma)
measures_b2
::grViz(
DiagrammeRpaste0(
"
digraph {
graph [ranksep = 0.2]
node [shape = box, width = 0.4, height = 0.15, fontsize = 3, fixedsize = TRUE, penwidth = 0.2]
T1R [label = 'L: -8.18 \n w* = ", round(measures_b2$w_L, digits = 2), "']
T1L [label = 'R: 0.71 , 1.71 , 5.71 \n w* = ", round(measures_b2$w_R, digits = 2), "']
T0 [label = '-8.18, 0.71 , 1.71 , 5.71']
edge [penwidth = 0.2, arrowsize = 0.3, len = 0.3]
T0->T1L
T0->T1R
{ rank = same; T1R; T1L}
}
"
)
)
Let’s now update our predictions.
%>%
data 1, f_2 := f_1 + measures_b2$w_L * eta] %>%
.[2:4, f_2 := f_1 + measures_b2$w_R * eta] .[
Code
ggplot(data = data) +
geom_point(aes(y = y, x = x, color = "observed"), shape = 0) +
geom_point(aes(y = f_2, x = x, color = "f2")) +
geom_point(aes(y = f_1, x = x, color = "f1")) +
geom_point(aes(y = f_0, x = x, color = "f0")) +
scale_color_manual(
values =
c(
"f0" = "blue",
"f1" = "red",
"f2" = "red",
"observed" = "black"
), name = ""
+
) geom_segment(
aes(y = f_0, x = x, yend = f_1, xend = x),
color = "blue",
arrow = arrow(length = unit(0.2, "cm"))
+
) geom_segment(
aes(y = f_1, x = x, yend = f_2, xend = x),
color = "blue",
arrow = arrow(length = unit(0.2, "cm"))
+
) theme_bw()
Again, we made small improvements in our predictions. This process continues until user-specified stopping criteria is met.
8.4 Implementation
You can use the xgboost
package to implement XGB modeling.
library(xgboost)
The first task is to create a class of matrix called xgb.DMatrix
using the xgb.DMatrix()
function. You provide the explanatory variable data matrix to the data
option and the dependent variable matrix (vector) to the label
option in xgb.DMatrix()
like below.
Let’s get the mlb1
data from the wooldridge
package for demonstration.
library(wooldridge)
data(mlb1)
<-
mlb1_dt %>%
mlb1 data.table() %>% # turn into data.table
:= NULL] %>% # remove salary (use lsalary instead)
.[, salary na.omit() # remove observations with NA in any of the variables
<-
mlb1_dm_X xgb.DMatrix(
data = as.matrix(mlb1_dt[, .(hruns, years, rbisyr, allstar, runsyr, hits, bavg)]),
label = as.matrix(mlb1_dt[, lsalary])
)
We can then use xgb.train()
to train a model using the XGB algorithm.
<-
xgb_fit xgb.train(
data = mlb1_dm_X, # independent variable
nrounds = 100, # number of iterations (trees to add)
eta = 1, # learning rate
objective = "reg:squarederror" # objective function
)
8.5 Resources
This helps for some of the commonly used loss functions↩︎