library(data.table)
library(tidyverse)
library(mgcv)
library(caret)
library(parallel)
library(fixest)
library(tidymodels)
library(ranger)
5 Bootstrap
5.1 What is it for?
Bootstrap can be used to quantify the uncertainty associated with an estimator. For example, you can use it to estimate the standard error (SE) of a coefficient of a linear model. Since there are closed-form solutions for that, bootstrap is not really bringing any benefits to this case. However, the power of bootstrap comes in handy when you do NOT have a closed form solution. We will first demonstrate how bootstrap works using a linear model, and then apply it to a case where no-closed form solution is available.
5.2 How it works
Packages to load for replication
Here are the general steps of a bootstrap:
- Step 1: Sample the data with replacement (You can sample the same observations more than one times. You draw a ball and then you put it back in the box.)
- Step 2: Run a statistical analysis to estimate whatever quantity you are interested in estimating
- Repeat Steps 1 and 2 many times and store the estimates
- Derive uncertainty measures from the collection of estimates obtained above
Let’s demonstrate this using a very simple linear regression example.
Here is the data generating process:
set.seed(89343)
<- 100
N <- rnorm(N)
x <- 2 * x + 2 * rnorm(N) y
We would like to estimate the coefficient on \(x\) by applying OLS to the following model:
\[ y = \alpha + \beta_x + \mu \]
We know from the econometric theory class that the SE of \(\hat{\beta}_{OLS}\) is \(\frac{\sigma}{\sqrt{SST_x}}\), where \(\sigma^2\) is the variance of the error term (\(\mu\)) and \(SST_x = \sum_{i=1}^N (x_i - \bar{x})^2\) (\(\bar{x}\) is the mean of \(x\)).
<- mean(x)
mean_x <- ((x-mean(x))^2) %>% sum()
sst_x
(<- sqrt(4 / sst_x)
se_bhat )
[1] 0.217933
So, we know that the true SE of \(\hat{\beta}_{OLS}\) is 0.217933. There is not really any point in using bootstrap in this case, but this is a good example to see if bootstrap works or not.
Let’s implement a single iteration of the entire bootstrap steps (Steps 1 and 2).
#=== set up a dataset ===#
<-
data data.table(
y = y,
x = x
)
Now, draw observations with replacement so the resulting dataset has the same number of observations as the original dataset.
<- nrow(data)
num_obs
#=== draw row numbers ===#
(<- sample(seq_len(num_obs), num_obs, replace = TRUE)
row_indices )
[1] 29 56 39 83 52 66 70 19 4 34 28 34 81 32 9 95 99 86 4 79 30 15 41 97 43
[26] 89 60 41 16 19 66 96 34 91 86 67 75 28 74 50 71 95 74 87 58 27 9 65 80 41
[51] 71 64 21 47 45 77 97 94 72 50 23 10 33 45 14 17 82 56 33 75 70 63 78 81 64
[76] 16 84 90 2 17 5 46 53 37 93 85 72 63 10 35 42 20 70 49 74 32 25 73 76 32
Use the sampled indices to create a bootstrapped dataset:
You could also use bootstraps()
from the rsample
package.
<- data[row_indices, ] temp_data
Now, apply OLS to get a coefficient estimate on \(x\) using the bootstrapped dataset.
lm(y ~ x, data = temp_data)$coefficient["x"]
x
2.040957
This is the end of Steps 1 and 2. Now, let’s repeat this step 1000 times. First, we define a function that implements Steps 1 and 2.
<- function(i, data)
get_beta
{<- nrow(data)
num_obs
#=== sample row numbers ===#
<- sample(seq_len(num_obs), num_obs, replace = TRUE)
row_indices
#=== bootstrapped data ===#
<- data[row_indices, ]
temp_data
#=== get coefficient ===#
<- lm(y ~ x, data = temp_data)$coefficient["x"]
beta_hat
return(beta_hat)
}
Now repeat get_beta()
many times:
<-
beta_store lapply(
1:1000,
function(x) get_beta(x, data)
%>%
) unlist()
Calculate standard deviation of \(\hat{\beta}_{OLS}\),
sd(beta_store)
[1] 0.2090611
Not, bad. What if we make the number of observations to 1000 instead of 100?
set.seed(67343)
#=== generate data ===#
<- 1000
N <- rnorm(N)
x <- 2 * x + 2 * rnorm(N)
y
#=== set up a dataset ===#
<-
data data.table(
y = y,
x = x
)
#=== true SE ===#
<- mean(x)
mean_x <- sum(((x-mean(x))^2))
sst_x
(<- sqrt(4 / sst_x)
se_bhat )
[1] 0.06243842
#=== bootstrap-estimated SE ===#
<-
beta_store lapply(
1:1000,
function(x) get_beta(x, data)
%>%
) unlist()
sd(beta_store)
[1] 0.06147708
This is just a single simulation. So, we cannot say bootstrap works better when the number of sample size is larger only from these experiments. But, it is generally true that bootstrap indeed works better when the number of sample size is larger.
5.3 A more complicated example
Consider a simple production function (e.g., yield response functions for agronomists):
\[ y = \beta_1 x + \beta_2 x^2 + \mu \]
- \(y\): output
- \(x\): input
- \(\mu\): error
The price of \(y\) is 5 and the cost of \(x\) is 2. Your objective is to identify the amount of input that maximizes profit. You do not know \(\beta_1\) and \(\beta_2\), and will be estimating them using the data you have collected. Letting \(\hat{\beta_1}\) and \(\hat{\beta_2}\) denote the estimates of \(\beta_1\) and \(\beta_2\), respectively, the mathematical expression of the optimization problem is:
\[ Max_x 5(\hat{\beta}_1 x + \hat{\beta}_2 x^2) - 2 x \]
The F.O.C is
\[ 5\hat{\beta}_1 + 10 \hat{\beta}_2 x - 2 = 0 \]
So, the estimated profit-maximizing input level is \(\hat{x}^* = \frac{2-5\hat{\beta}_1}{10\hat{\beta}_2}\). What we are interested in knowing is the SE of \(x^*\). As you can see, it is a non-linear function of the coefficients, which makes it slightly harder than simply getting the SE of \(\hat{\beta_1}\) or \(\hat{\beta_2}\). However, bootstrap can easily get us an estimate of the SE of \(\hat{x}^*\). The bootstrap process will be very much the same as the first bootstrap example except that we will estimate \(x^*\) in each iteration instead of stopping at estimating just coefficients. Let’s work on a single iteration first.
Alternatively, you could use the delta method, which lets you find an estimate of the SE of a statistics that is a non-linear function of estimated coefficients.
Here is the data generating process:
set.seed(894334)
<- 1000
N <- runif(N) * 3
x <- 6 * x - 2 * x^2
ey <- 2 * rnorm(N)
mu <- ey + mu
y
<-
data data.table(
x = x,
y = y,
ey = ey
)
Under the data generating process, here is what the production function looks like:
Code
ggplot(data = data) +
geom_line(aes(y = ey, x = x)) +
theme_bw()
<- nrow(data)
num_obs <- sample(seq_len(num_obs), num_obs, replace = TRUE)
row_indices <- data[row_indices, ]
boot_data <- lm(y ~ x + I(x^2), data = boot_data) reg
Now that we have estimated \(\beta_1\) and \(\beta_2\), we can easily estimate \(x^*\) using its analytical formula.
(<- (2 - 5 * reg$coef["x"])/ (10 * reg$coef["I(x^2)"])
x_star )
x
1.40625
We can repeat this many times to get a collection of \(x^*\) estimates and calculate the standard deviation.
<- function(i)
get_x_star
{<- sample(seq_len(num_obs), num_obs, replace = TRUE)
row_indices <- data[row_indices, ]
boot_data <- lm(y ~ x + I(x^2), data = boot_data)
reg <- (2 - 5 * reg$coef["x"])/ (10 * reg$coef["I(x^2)"])
x_star }
<-
x_stars lapply(
1:1000,
get_x_star%>%
) unlist()
Here is the histogram:
Code
hist(x_stars, breaks = 30)
So, it seems to follow a normal distribution. You can get standard deviation of x_stars
as an estimate of the SE of \(\hat{x}^*\).
sd(x_stars)
[1] 0.01783721
You can get the 95% confidence interval (CI) like below:
quantile(x_stars, prob = c(0.025, 0.975))
2.5% 97.5%
1.364099 1.430915
5.4 One more example with a non-parametric model
We now demonstrate how we can use bootstrap to get an estimate of the SE of \(\hat{x}^*\) when we use random forest (RF) as our regression method instead of OLS. When RF is used, we do not have any coefficients like the OLS case above. Even then, bootstrap allows us to estimate the SE of \(\hat{x}^*\).
RF will be covered in Section 6.2. Please take a quick glance at what RF is to confirm this point. No deep understanding of RF is necessary to understand the significance of bootstrap in this example. Please also note that there is no benefit in using RF in this example because we have only one variable and semi-parametric approach like gam()
will certainly do better. RF is used here only because it does not give you coefficient estimates (there are no coefficients in the first place for RF) like linear models.
The procedure is exactly the same except that we use RF to estimate the production function and also that we need to conduct numerical optimization as no analytical formula is available unlike the case above.
We first implement a single iteration.
#=== get bootstrapped data ===#
<- sample(seq_len(num_obs), num_obs, replace = TRUE)
row_indices <- data[row_indices, ]
boot_data
#=== train RF ===#
<- ranger(y ~ x, data = boot_data) reg_rf
Once you train RF, we can predict yield at a range of values of \(x\), calculate profit, and then pick the value of \(x\) that maximizes the estimated profit.
#=== create series of x values at which yield will be predicted ===#
<- data.table(x = seq(0, 3, length = 1000))
eval_data
#=== predict yield based on the trained RF ===#
:= predict(reg_rf, eval_data)$predictions] eval_data[, y_hat
Here is what the estimated production function looks like:
Code
#=== plot ===#
ggplot(data = eval_data) +
geom_line(aes(y = y_hat, x = x)) +
theme_bw()
Well, it is very spiky (we need to tune hyper-parameters using KCV. But, more on this later. The quality of RF estimation has nothing to do with the goal of this section).
We can now predict profit at each value of \(x\).
#=== calculate profit ===#
:= 5 * y_hat - 2 * x]
eval_data[, profit_hat
head(eval_data)
x y_hat profit_hat
1: 0.000000000 -0.4807528 -2.403764
2: 0.003003003 -0.4807528 -2.409770
3: 0.006006006 -0.4807528 -2.415776
4: 0.009009009 1.0193781 5.078872
5: 0.012012012 1.0193781 5.072866
6: 0.015015015 1.2466564 6.203252
The only thing left for us to do is find the \(x\) value that maximizes profit.
which.max(profit_hat), ] eval_data[
x y_hat profit_hat
1: 1.273273 8.675033 40.82862
Okay, so 1.2732733 is the \(\hat{x}^*\) from this iteration.
As you might have guessed already, we can just repeat this step to get an estimate of the SE of \(\hat{x}^*_{RF}\).
<- function(i)
get_x_star_rf
{print(i) # progress tracker
#=== get bootstrapped data ===#
<- sample(seq_len(num_obs), num_obs, replace = TRUE)
row_indices <- data[row_indices, ]
boot_data
#=== train RF ===#
<- ranger(y ~ x, data = boot_data)
reg_rf
#=== create series of x values at which yield will be predicted ===#
<- data.table(x = seq(0, 3, length = 1000))
eval_data
#=== predict yield based on the trained RF ===#
:= predict(reg_rf, eval_data)$predictions]
eval_data[, y_hat
#=== calculate profit ===#
:= 5 * y_hat - 2 * x]
eval_data[, profit_hat
#=== find x_star_hat ===#
<- eval_data[which.max(profit_hat), x]
x_star_hat
return(x_star_hat)
}
<-
x_stars_rf mclapply(
1:1000,
get_x_star_rf,mc.cores = 12
%>%
) unlist()
#=== Windows user ===#
# library(future.apply)
# plan("multisession", workers = detectCores() - 2)
# x_stars_rf <-
# future_lapply(
# 1:1000,
# get_x_star_rf
# ) %>%
# unlist()
Here are the estimate of the SE of \(\hat{x}^*_{RF}\) and 95% CI.
sd(x_stars_rf)
[1] 0.3949154
quantile(x_stars_rf, prob = c(0.025, 0.975))
2.5% 97.5%
0.5465465 2.0570571
As you can see, the estimation of \(x^*\) is much more inaccurate than the previous OLS approach. This is likely due to the fact that we are not doing a good job of tuning the hyper-parameters of RF (but, again, more on this later).
This conclude the illustration of the power of using bootstrap to estimate the uncertainty of the statistics of interest (\(x^*\) here) when the analytical formula of the statistics is non-linear or not even known.
5.5 Bag of Little Bootstraps
Bag of little bootstraps (BLB) is a variant of bootstrap that has similar desirable statistical properties as bootstrap, but is more (computer) memory friendly (Kleiner et al. 2014).
When the sample size is large (say 1GB) bootstrap is computationally costly both in terms of time and memory because the data of the same size is bootstrapped for many times (say, 1000 times). BLB overcomes this problem.
Suppose you are interested in \(\eta\) (e.g., standard error, confidence interval) of the estimator \(\hat{\theta}\). Further, let \(N\) denoted the sample size of the training data.
BLB works as follows:
- Split the dataset into \(S\) sets of subsamples of size \(B\) so that \(s \times B = N\) without replacement.
- For each of the subsample, bootstrap \(N\) ( not \(B\) ) samples with replacement \(M\) times, and find \(\eta\) (call it \(\eta_s\)) for each of the subsample sets.
- Average \(\hat{\eta}_s\) to get \(\hat{eta}\).
5.5.1 Demonstrations
Let’s go back to the example discussed in Section 5.3, where the goal is to identify the confidence interval (\(\eta\)) of the economically optimal input rate (\(\theta\)).
Step 1:
Step 1: Split the dataset into \(S\) sets of subsamples of size \(B\) so that \(s \times B = N\) without replacement
Here, we set \(S\) at 5, so \(B = 200\).
<- nrow(data)
N <- 5
S
(<-
subsamples sample(seq_len(N), N), ] %>%
data[subgroup := rep(1:S, each = N/S)] %>%
.[, split(.$subgroup)
)
$`1`
x y ey subgroup
1: 2.3733159 5.30861102 2.974639 1
2: 1.6275776 3.09436504 4.467448 1
3: 0.2254504 -0.00443899 1.251047 1
4: 1.9988901 3.51795123 4.002217 1
5: 2.6819931 2.65018050 1.705785 1
---
196: 0.6356858 1.70404729 3.005922 1
197: 0.5894847 5.41369203 2.841924 1
198: 1.1667444 5.82214614 4.277881 1
199: 0.9570687 4.01145812 3.910451 1
200: 2.6627629 5.73397413 1.795965 1
$`2`
x y ey subgroup
1: 1.8809192 4.57934497 4.2098011 2
2: 0.1314569 -0.61561180 0.7541794 2
3: 2.5253701 1.49742398 2.3972324 2
4: 0.8415726 5.29003218 3.6329468 2
5: 0.4067221 3.22079460 2.1094867 2
---
196: 0.3281349 5.33737092 1.7534644 2
197: 0.1399768 3.43767708 0.8006737 2
198: 2.6292985 -0.06800098 1.9493698 2
199: 2.3410270 3.73951071 3.0853472 2
200: 2.8142845 -2.75691735 1.0453126 2
$`3`
x y ey subgroup
1: 1.2296299 5.0020342 4.353800036 3
2: 0.8974602 5.2290199 3.773891666 3
3: 1.0734530 1.0161772 4.136115258 3
4: 2.8013991 2.7505559 1.112720709 3
5: 1.5841919 4.9322236 4.485823441 3
---
196: 0.1508607 1.6273472 0.859646069 3
197: 0.6124024 1.8253166 2.924341101 3
198: 2.9990628 -0.4053550 0.005621654 3
199: 2.8691630 -0.9073811 0.750785549 3
200: 1.0579288 6.2074265 4.109146183 3
$`4`
x y ey subgroup
1: 2.3998924 -0.8807641 2.880388 4
2: 1.0257974 6.6735166 4.050264 4
3: 1.1611026 3.5404186 4.270297 4
4: 1.0894566 2.5421898 4.162908 4
5: 0.1975354 0.3857778 1.107172 4
---
196: 0.7921199 7.0038608 3.497812 4
197: 2.6279086 1.9564617 1.955645 4
198: 1.4766912 2.8731862 4.498913 4
199: 0.6217361 2.1170408 2.957305 4
200: 2.0381772 1.8095725 3.920731 4
$`5`
x y ey subgroup
1: 0.5453160 10.3885051 2.677156938 5
2: 1.2077153 1.2218954 4.329139357 5
3: 1.1022868 1.8370493 4.183648383 5
4: 2.9998229 -0.5902619 0.001062293 5
5: 1.8845175 0.8490326 4.204292516 5
---
196: 1.5471997 4.8499174 4.495544367 5
197: 0.5813285 3.7268747 2.812085354 5
198: 1.5839438 7.7635746 4.485906861 5
199: 0.2343757 1.8121762 1.296390509 5
200: 1.9727159 1.9638589 4.053079371 5
You could alternatively do this:
<- vfold_cv(data, v = 5)
fold_data
<-
subsamples lapply(
1:5,
function(x) fold_data[x, ]$splits[[1]] %>% assessment()
)
Steps 2:
- Step 2: For each of the subsample, bootstrap \(N\) ( not \(B\) ) samples with replacement \(M\) times, and find \(\eta\) (call it \(\eta_s\)) for each of the subsample sets.
Instead of creating bootstrapped samples \(M\) times, only one bootstrapped sample from the first subsample is created for now, and then the estimate of \(x^*\) is obtained.
(<-
boot_data sample(
seq_len(nrow(subsamples[[1]])),
N,replace = TRUE
%>%
) 1]][., ]
subsamples[[ )
x y ey
1: 2.99982294 -0.5902619 0.001062293
2: 0.36841088 0.6414166 1.939012145
3: 0.36841088 0.6414166 1.939012145
4: 1.52659488 4.1079431 4.498585425
5: 2.68124652 1.4721212 1.709313320
---
996: 2.20473789 2.7760518 3.506689002
997: 0.67635816 6.1153578 3.143228227
998: 1.89649565 6.3148116 4.185582404
999: 0.03379467 -0.2925077 0.200483835
1000: 2.51539783 1.7992266 2.437934475
We can now train a linear model using the bootstrapped data and calculate \(\hat{x}^*\) based on it.
<- lm(y ~ x + I(x^2), data = boot_data)
reg
(<- (2 - 5 * reg$coef["x"])/ (10 * reg$coef["I(x^2)"])
x_star )
x
1.367197
We can loop this process over all the \(100\) (\(M = 100\)) bootstrapped data from the fist subsample set. Before that, let’s define a function that conducts a single implementation of bootstrapping and estimating \(x^*\) from a single subsample set.
<- function(subsample)
get_xstar_boot
{<-
boot_data sample(
seq_len(nrow(subsample)),
N,replace = TRUE
%>%
)
subsample[., ]<- lm(y ~ x + I(x^2), data = boot_data)
reg
<- (2 - 5 * reg$coef["x"])/ (10 * reg$coef["I(x^2)"])
x_star
return(x_star)
}
This is a single iteration.
get_xstar_boot(subsamples[[1]])
x
1.400198
Now, let’s repeat this 100 (\(M = 100\)) times.
<- 100
M
<-
xstar_hats lapply(
1:M,
function(x) get_xstar_boot(subsamples[[1]])
%>%
) unlist()
We can now get the 95% confidence interval of \(\hat{x}^*\).
<- quantile(xstar_hats, prob = 0.025)
cf_lower <- quantile(xstar_hats, prob = 0.975) cf_upper
So, [1.36848876286897,1.44320748221065] is the 95% confidence interval estimate from the first subsample set. We repeat this for the other subsample sets and average the 95% CI from them (get_CI()
defined on the side).
This function find the 95% confidence interval (\(\eta\)) from a single subsample set.
<- function(subsample){
get_CI <-
xstar_hats lapply(
1:M,
function(x) get_xstar_boot(subsamples[[1]])
%>%
) unlist()
<- quantile(xstar_hats, prob = 0.025)
cf_lower <- quantile(xstar_hats, prob = 0.975)
cf_upper
<-
return_data data.table(
cf_lower = cf_lower,
cf_upper = cf_upper
)
return(return_data)
}
(<-
ci_data lapply(
subsamples,function(x) get_CI(x)
%>%
) rbindlist()
)
cf_lower cf_upper
1: 1.374328 1.443085
2: 1.368850 1.443649
3: 1.365456 1.444492
4: 1.368362 1.440309
5: 1.362626 1.445573
Steps 3:
By averaging the lower and upper bounds, we get our estimate of the lower and upper bound of the 95% CI of \(\hat{x}^*\).
mean(cf_lower), mean(cf_upper))] ci_data[, .(
V1 V2
1: 1.367924 1.443422
Note the estimated CI is very much similar to that from a regular bootstrap obtained in Section 5.3.