Standard Error Estimation

---

.remark-slide-content.hljs-github h1 {
  margin-top: 5px;  
  margin-bottom: 25px;  
}

.remark-slide-content.hljs-github {
  padding-top: 10px;  
  padding-left: 30px;  
  padding-right: 30px;  
}

.panel-tabs {
  
  color: #841F27;
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 0px;  
  padding-bottom: 0px;  
}

.panel-tab {
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 3px;  
  margin-right: 3px;  
  padding-top: 0px;  
  padding-bottom: 0px;  
}

.panelset .panel-tabs .panel-tab {
  min-height: 40px;
}

.remark-slide th {
  border-bottom: 1px solid #ddd;
}

.remark-slide thead {
  border-bottom: 0px;
}

.gt_footnote {
  padding: 2px;  
}

.remark-slide table {
  border-collapse: collapse;
}

.remark-slide tbody {
  border-bottom: 2px solid #666;
}

.important {
  background-color: lightpink;
  border: 2px solid blue;
  font-weight: bold;
}

.remark-code {
  display: block;
  overflow-x: auto;
  padding: .5em;
  background: #ffe7e7;
}

.hljs-github .hljs {
  background: #f2f2fd;
}

.remark-inline-code {
  padding-top: 0px;
  padding-bottom: 0px;
  background-color: #e6e6e6;
}

.r.hljs.remark-code.remark-inline-code{
  font-size: 0.9em
}

.left-full {
  width: 80%;
  height: 92%;
  float: left;
}

.left-code {
  width: 38%;
  height: 92%;
  float: left;
}

.right-plot {
  width: 60%;
  float: right;
  padding-left: 1%;
}

.left5 {
  width: 49%;
  height: 92%;
  float: left;
}

.right5 {
  width: 49%;
  float: right;
  padding-left: 1%;
}

.left3 {
  width: 29%;
  height: 92%;
  float: left;
}

.right7 {
  width: 69%;
  float: right;
  padding-left: 1%;
}

.left4 {
  width: 38%;
  height: 92%;
  float: left;
}

.right6 {
  width: 60%;
  float: right;
  padding-left: 1%;
}

ul li{
  margin: 7px;
}

ul, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

ol li{
  margin: 7px;
}

ol, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

</style>

.full-width {
    display: flex;
    width: 100%;
    flex: 1 1 auto;
}
</style>

.blockquote p {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h1:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h2:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h3:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h4:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}

.text-shadow {
  text-shadow: 0 0 4px #424242;
}
</style>

.scroll-box-8 {
  height:8em;
  overflow-y: scroll;
}
.scroll-box-10 {
  height:10em;
  overflow-y: scroll;
}
.scroll-box-12 {
  height:12em;
  overflow-y: scroll;
}
.scroll-box-14 {
  height:14em;
  overflow-y: scroll;
}
.scroll-box-16 {
  height:16em;
  overflow-y: scroll;
}
.scroll-box-18 {
  height:18em;
  overflow-y: scroll;
}
.scroll-box-20 {
  height:20em;
  overflow-y: scroll;
}
.scroll-box-24 {
  height:24em;
  overflow-y: scroll;
}
.scroll-box-30 {
  height:30em;
  overflow-y: scroll;
}
.scroll-output {
  height: 90%;
  overflow-y: scroll;
}

</style>

# Before we start

## Learning objectives

Understand the consequences of the violation of the homoskedasticity assumption and how to deal with the problem

## Table of contents

1. [Review on statistical hypothesis testing](#review)
2. [Testing (linear model)](#linear)
3. [Confidence interval](#conf)

---

# Heteroskedasticity

---
class: middle

# Homoskedasticity and Heteroskedasticity

`$$Var(u|x) = \sigma^2$$`

<br>
.content-box-red[**Heteroskedasticity**]

`$$Var(u|x) = f(x)$$`

]

---
class: middle

# Visualization

---
class: middle

# Central Questions

What are the consequences of assuming the error is homoskedastic when it is heteroskedastic in reality?

+ Estimation of coefficients `$(\hat{\beta}_j)$`?
+ Estimation of the variance of `$\hat{\beta}_j$`?

---
class: middle

# Coefficient estimators

Are OLS estimators unbiased when error is heteroskedastic?

Yes

We do not use the homoskedasticity assumption to prove that the OLS estimator is unbiased.

---
class: middle

# Variance of the coefficient estimators

We learned that when the homoskedasticity assumption holds, then,

`$Var(\hat{\beta}_j) = \frac{\sigma^2}{SST_x(1-R^2_j)}$`

We used the following as the estimator of `$Var(\hat{\beta}_j)$`

`$\frac{\hat{\sigma}^2}{SST_x(1-R^2_j)}$` where `$\hat{\sigma}^2 = \frac{\sum_{i=1}^{N} \hat{u}_i^2}{N-k-1}$`

<br>

.content-box-red[**Important**]: By default, R and other statistical software uses this formula to get estimates of the variance of `$\hat{\beta}_j$`.

.content-box-green[**Note**]: Remember, we let `$\widehat{Var(\hat{\beta_j})}$` denote the estimator of the variance of `$\hat{\beta}_j$`.

---
class: middle

# Variance of the coefficient estimators

But, under heteroskedasticity,

`$Var(\hat{\beta}_j) \ne \frac{\sigma^2}{SST_x(1-R^2_j)}$`

.content-box-green[**Question**]:
   
Is `$E[\widehat{Var(\hat{\beta}_j)}]=E\Big[\frac{\hat{\sigma}^2}{SST_x(1-R^2_j)}\Big]=Var(\hat{\beta}_j)$` under heteroskedasticity?

---
class: middle

# Variance of the coefficient estimators

So, what are the consequences of using `$\widehat{Var(\hat{\beta}_j)}=\frac{\hat{\sigma}^2}{SST_x(1-R^2_j)}$` under heteroskedasticity?

Your hypothesis testing is going to be biased.

What does it mean to have hypothesis testing biased? Roughly speaking, it means that you over-reject/under-reject the hypothesis than you intend to.

---
class: middle

# Consequence of heteroskedasticity on testing

Let's run MC simulations to see the consequence of ignoring heteroskedasticity.

`$y = 1 + \beta x + u$`, where `$\beta = 0$`

+ `$H_0:$` `$\beta=0$`
+ `$H_1:$` `$\beta \ne 0$`

If you test the null hypothesis at the `$5\%$` significance level, what should be the probability that you reject the null hypothesis when it is actually true?

`$Pr(\mbox{reject} \;\; H_0|H_0 \;\; \mbox{is true})=?$`

---
class: middle

If you test the null hypothesis at the `$5\%$` significance level, what should be the probability that you reject the null hypothesis when it is actually true?

Since the null is true (we generate the data that way!), the probability you reject the null should be the same as the significance level, which is `$5%$`.

---
class: middle

# MC simulation: conceptual steps

+ generate a dataset so that `$\beta_1$` (the coefficient on `$x$`) is zero

`$$y=\beta_0+\beta_1 x + v$$`

+ estimate the model and find `$\hat{\beta}_1$` and `$\widehat{se(\hat{\beta}_1)}$`
+ calculate `$t$`-statistic `$(\hat{\beta}_x-0/\widehat{se(\hat{\beta}_x)})$` and decide whether you reject the null or not
+ repeat the above 1000 times
+ check how often you reject the null (should be close to 50 times)

---
class: middle

# MC simulation: R code

```r
set.seed(927834)

N <- 1000 # number of observations
B <- 1000 # number of simulations

b_hat_store <- rep(0, B) # beta hat storage
t_stat_store <- rep(0, B) # t-stat storage
c_value <- qt(0.975, N - 2) # critical value

x <- runif(N, 0, 1) # x (fixed across iterations)

for (i in 1:B){
  #--- generate data ---#
  het_u <- 3 * rnorm(N, mean = 0, sd = 2 * x) # heteroskedastic error
  y <- 1 + het_u # y
  data_temp <- data.frame(y = y, x = x)

#--- regression ---#
  ols_res <- lm(y ~ x, data = data_temp)

b_hat <- ols_res$coef['x'] # coef estimate on x
  b_hat_store[i] <- b_hat # save the coef estimate
  vcov_ols <- vcov(ols_res) # get variance covariance matrix
  t_stat_store[i] <- b_hat / sqrt(vcov_ols['x', 'x']) # calculate t-stat
} 
```

---
class: middle

# MC simulation: Results

```r
#--- how many times do you reject? ---#
reject_or_not <- abs(t_stat_store) > c_value
  mean(reject_or_not) 
```

```
## [1] 0.108
```

<br>

We rejected the null hypothesis 10.8% of the time, instead of `$5\%$`.

+ So, you are more likely to claim that `$x$` has a statistically significant impact than you are supposed to.

+ The use of the formula `$\frac{\hat{\sigma}^2}{SST_x(1-R^2_j)}$` seemed to (over/under)-estimate the true variance of the OLS estimators?

+ In general, the direction of bias is ambiguous.

---
class: middle

# How should we address this problem?

+ Now, we understand the consequence of heteroskedasticity:

`$\frac{\hat{\sigma}^2}{SST_x(1-R^2_j)}$` is a biased estimator of `$Var(\hat{\beta})$`, which makes any kind of testings based on it invalid.

+ Can we credibly estimate the variance of the OLS estimators?

<br>

+ valid in the presence of heteroskedasticity of .red[unknown form]
+ heteroskedasticity-robust standard error estimator in short

---
class: middle

`$\widehat{Var(\hat{\beta}_j)} = \frac{\sum_{i=1}^n \hat{r}^2_{i,j} \hat{u}^2_i}{SSR^2_j}$`

+ `$\hat{u}_i$`: residual from regressing `$y$` on all the independent variables
+ `$\hat{r}_{i,j}$`: residual from regressing `$x_j$` on all other independent variables for `$i$`th observation
+ `$SSR^2_j$`: the sum of squared residuals from regressing `$x_j$` on all other independent variables

---
class: middle

We spend .red[NO] time to try to understand what's going on with the estimator.

What you need is

+ understand the consequence of heteroskedasticity
+ know there is an estimator that is appropriate under heteroskedasticity, meaning that it will give you the correct estimate of the variance of the OLS estimator
+ know how to use the heteroskedasticity-robust standard error estimator in practice using `$R$` (or some other software)

---
class: middle

# In practice

Here is the well-accepted procedure in econometric analysis:

+ Estimate the model using OLS (you do nothing special here)

+ Assume the error term is heteroskedastic and estimate the variance of the OLS estimators 
  * There are tests to whether error is heteroskedastic or not: .red[Breusch-Pagan] test and .red[White] test
  * In practice, almost nobody bothers to conduct these tests
  * We do not learn how to run these tests

+ Replace the estimates from `$\widehat{Var(\hat{\beta})}_{default}$` with those from `$\widehat{Var(\hat{\beta})}_{robust}$` for testing

+ But, we do not replace coefficient estimates (remember, coefficient estimation is still unbiased under heteroskedasticity)

---
class: middle

# Implementation in R

We use

+ the `fixest::se()` function from the `fixest` package to estimate heteroskedasticity-robust standard errors

+ the `summary()` function do tests of `$\beta_j = 0$`

Let's run a regression using `MLB1.dta`.

```r
#--- library ---#
library(wooldridge)

#--- import the data ---#
data("mlb1")

#--- regression ---#
reg_mlb <- feols(log(salary) ~ years + bavg, data = mlb1) 
```

---
class: middle

# Obtaining Heteroskedasticity-robust SE estimates

Here is the general syntax to obtain various types of VCOV (and se) esimaties:

```r
#* vcov
vcov(regression result, vcov = "type of vcov")

#* only the standard errors
se(regression result, vcov = "type of vcov")
```

<br>

Specifically for White-Huber heteroskedasticity-robust VCOV and se estimates,

```r
#* vcov
vcov(reg_mlb, vcov = "hetero")
```

```
##              (Intercept)         years          bavg
## (Intercept)  0.495103882  0.0058059916 -2.080065e-03
## years        0.005805992  0.0003117152 -2.976110e-05
## bavg        -0.002080065 -0.0000297611  8.892009e-06
```

```r
#* only the standard errors
se(reg_mlb, vcov = "hetero")
```

```
## (Intercept)       years        bavg 
## 0.703636186 0.017655458 0.002981947
```

---
class: middle

# Compare with the Default

```r
(
  se_hom <- se(reg_mlb)
)
```

```
## (Intercept)       years        bavg 
## 0.343071420 0.013222511 0.001335294
```

<br>

```
## (Intercept)       years        bavg 
## 0.703636186 0.017655458 0.002981947
```

---
class: middle

# Updating the test of coefficients being zero

```r
tidy(reg_mlb)
```

```
## # A tibble: 3 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) 11.0       0.343       32.2  1.25e-106
## 2 years        0.166     0.0132      12.6  3.16e- 30
## 3 bavg         0.00539   0.00134      4.04 6.59e-  5
```

<br>

```r
tidy(reg_mlb, vcov = "hetero")
```

```
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 11.0       0.704       15.7  2.13e-42
## 2 years        0.166     0.0177       9.43 5.96e-19
## 3 bavg         0.00539   0.00298      1.81 7.13e- 2
```

---
class: middle

# Reporting the regression results

In presenting the regression results in a nicely formatted table, we used `modelsummary::msummary()`.

We can easily swap the defulat se with the heteroskedasticity-robust se using the `statistic_override` option in `msummary()`.

```r
vcov_het <- vcov(reg_mlb, vcov = "hetero")
vcov_homo <- vcov(reg_mlb)

modelsummary::msummary(
  list(reg_mlb, reg_mlb),
  statistic_override = list(vcov_het, vcov_homo),
  # keep these options as they are
  stars = TRUE,
  gof_omit = "IC|Log|Adj|F|Pseudo|Within"
) 
```
]

.right5[
<table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;">  (1) </th>
   <th style="text-align:center;">   (2) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 11.042*** </td>
   <td style="text-align:center;"> 11.042*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.704) </td>
   <td style="text-align:center;"> (0.343) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> years </td>
   <td style="text-align:center;"> 0.166*** </td>
   <td style="text-align:center;"> 0.166*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.018) </td>
   <td style="text-align:center;"> (0.013) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bavg </td>
   <td style="text-align:center;"> 0.005+ </td>
   <td style="text-align:center;"> 0.005*** </td>
  </tr>
  <tr>
   <td style="text-align:left;box-shadow: 0px 1.5px">  </td>
   <td style="text-align:center;box-shadow: 0px 1.5px"> (0.003) </td>
   <td style="text-align:center;box-shadow: 0px 1.5px"> (0.001) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Num.Obs. </td>
   <td style="text-align:center;"> 353 </td>
   <td style="text-align:center;"> 353 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> R2 </td>
   <td style="text-align:center;"> 0.367 </td>
   <td style="text-align:center;"> 0.367 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> RMSE </td>
   <td style="text-align:center;"> 0.94 </td>
   <td style="text-align:center;"> 0.94 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Std.Errors </td>
   <td style="text-align:center;"> Custom </td>
   <td style="text-align:center;"> Custom </td>
  </tr>
</tbody>
<tfoot><tr><td style="padding: 0; " colspan="100%">
<sup></sup> + p &lt; 0.1, * p &lt; 0.05, ** p &lt; 0.01, *** p &lt; 0.001</td></tr></tfoot>
</table>
]

---
class: middle

# R Demonstration

Or, you could add the `vcov` option like below inside `feols()`. Then, you do not need `statistic_override` option to override the default VCOV estimates.

```r
reg_mlb <- feols(log(salary) ~ years + bavg, vcov = "hetero", data = mlb1)

tidy(reg_mlb)
```

```r
modelsummary::msummary(
  list(reg_mlb),
  # keep these options as they are
  stars = TRUE,
  gof_omit = "IC|Log|Adj|F|Pseudo|Within"
) 
```
]

.right5[
<table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;">  (1) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 11.042*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.704) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> years </td>
   <td style="text-align:center;"> 0.166*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.018) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bavg </td>
   <td style="text-align:center;"> 0.005+ </td>
  </tr>
  <tr>
   <td style="text-align:left;box-shadow: 0px 1.5px">  </td>
   <td style="text-align:center;box-shadow: 0px 1.5px"> (0.003) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Num.Obs. </td>
   <td style="text-align:center;"> 353 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> R2 </td>
   <td style="text-align:center;"> 0.367 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> RMSE </td>
   <td style="text-align:center;"> 0.94 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Std.Errors </td>
   <td style="text-align:center;"> Heteroskedasticity-robust </td>
  </tr>
</tbody>
<tfoot><tr><td style="padding: 0; " colspan="100%">
<sup></sup> + p &lt; 0.1, * p &lt; 0.05, ** p &lt; 0.01, *** p &lt; 0.001</td></tr></tfoot>
</table>
]

---
class: middle

# Het-robust SE estimator: validation

Does the heteroskedasticity-robust se estimator really work? Let's see using MC simulations:

```r
set.seed(478954)

#--- x fixed across iterations ---#
x <- runif(N,0,1) # x

for (i in 1:B){
  #--- generate data ---#
  het_u <- 3 * rnorm(N, mean = 0, sd = 2 * x) # heteroskedastic error
  y <- 1 + het_u # y
  data_temp <- data.frame(y = y, x = x)

#--- regression ---#
  ols_res <- feols(y ~ x,data = data_temp)
  b_hat <- ols_res$coefficient['x'] # coef estimate on x
  b_hat_store[i] <- b_hat # save the coef estimate
* se_het <- se(ols_res, vcov = "hetero")["x"] # get variance covariance matrix
  t_stat_store[i] <- b_hat/se_het # calculate t-stat
}  
```

---
class: middle

# MC simulation results

```r
reject_or_not <- abs(t_stat_store) > c_value
mean(reject_or_not)
```

```
## [1] 0.053
```

Okay, not perfect. But, certainly better.

---

# Clustered Error

---
class: middle

+ Often times, observations can be grouped into clusters
+ Errors within the cluster can be correlated

---
class: middle

College GPA: cluster by college

`$GPA_{col} = \beta_0 + \beta_1 income + \beta_2 GPA_{hs} + u$`

+ Your observations consist of students' GPA scores across many colleges
+ Because of some unobserved (omitted) school characteristics, error terms for the individuals in the same college might be correlated.
  * grading policy

---
class: middle

Eduction Impacts on Income: cluster by individual

+ Your observations consist of 500 individuals with each individual tracked over 10 years

+ Because of some unobserved (omitted) individual characteristics, error terms for time-series observations within an individual might be correlated.
  * innate ability

---
class: middle

# Consequences of clustered error

Are the OLS estimators of the coefficients biased in the presence of clustered error?

<br>

No, the correlation between `$x$` and `$u$` would hurt you, but not correlation among `$u$`.

---
class: middle

# Consequences of clustered error

Are `$\widehat{Var(\hat{\beta})}_{default}$` unbiased estimators of `$Var(\hat{\beta})$`?

<br>

No, `$\widehat{Var(\hat{\beta})}_{default}$` is unbiased only under homoskedasticity assumption, which assumes no correlation between errors.

---
class: middle

# Consequences of clustered error

Which has more information?
+ two errors that are independent
+ two errors that are correlated

<br>
.content-box-red[**Consequences**]

+ If you were to use `$\widehat{Var(\hat{\beta})}_{default}$` to estimate `$Var(\hat{\beta})$` in the presence of clustered error, you would (under/over)-estimate the true `$Var(\hat{\beta})$`.

+ This would lead to rejecting null hypothesis (more/less) often than you are supposed to.

---
class: middle

# MC simulations: conceptual steps

Here are the conceptual steps of the MC simulations to see the consequence of clustered error.

+ generate data according to the generating process in which the error terms `$(u)$` within the cluster (two clusters in this example) is correlated and `$\beta_1$` is set to 0 in the model below:

$$
`\begin{aligned}
y = \beta_0 + \beta_1 x + u
\end{aligned}`
$$

+ estimate the model and find `$\hat{\beta}_x$` and `$\widehat{se(\hat{\beta}_x)}$`
+ calculate `$t$`-statistic `$(\hat{\beta}_x/\widehat{se(\hat{\beta}_x)})$` for the (correct) null hypothesis of `$\beta_1 = 0$`
+ repeat steps 1-3 for 1000 times
+ see how many times out of 1000 times you reject the null hypothesis: `$H_0:$` `$\beta_x=0$`

---
class: middle

# R code: Data Genrating Process

```r
#--- setup ---#
library(MASS) # to use the mvrnorm() function later

N <- 2000 # total number of observations
G <- 50 # number of groups
Ng <- N/G # number of observations per group

#--- error correlated within group ---#
u <- 
  mvrnorm(
    G, mu = rep(0, Ng), 
    Sigma = matrix(10, nrow = Ng, ncol = Ng) + diag(Ng)
  ) %>% t() %>% c()

#--- x correlated within group ---#
x <- 
  mvrnorm(
    G, mu = rep(0, Ng), 
    Sigma = matrix(1, nrow = Ng, ncol = Ng) + diag(Ng) * .2
  ) %>% t() %>% c()

#--- other variables ---#
y <- 1 + 0 * x + u

#--- data.frame ---#
data <- data.frame(y = y, x = x, group = rep(1:G, each = Ng))
```

---
class: middle

# Visualization of the clustered nature of the data

---
class: middle

# R code: MC simualtion

```r
set.seed(58934)
B <- 1000
t_stat_store <- rep(0,B)
N <- 2000 # total number of observations
G <- 50 # number of groups
Ng <- N/G # number of observations per group

for (i in 1:B){
  #--- error correlated within group ---#
  u <- 
  mvrnorm(
    G, mu = rep(0, Ng), 
    Sigma = matrix(10, nrow = Ng, ncol = Ng) + diag(Ng)
  ) %>% t() %>% c()

#--- x correlated within group ---#
  x <- 
    mvrnorm(
      G, mu = rep(0, Ng), 
      Sigma = matrix(1, nrow = Ng, ncol = Ng) + diag(Ng) * .2
    ) %>% t() %>% c()

#--- other variables ---#
  y <- 1 + 0 * x + u

#--- data.frame ---#
  data <- data.frame(y = y, x = x, group = rep(1:G, each = Ng))

#--- OLS ---#
  reg <- feols(y ~ x, data = data)

#--- get vcov ---#
  se_default <- se(reg)["x"]

#--- calculate t-stat ---#
  t_stat <- reg$coefficient['x']/se_default
  t_stat_store[i] <- t_stat
} 
```

---
class: middle

# MC simulations: results

```r
c_value <- qt(0.975, N - 2)

#--- how often do you reject the null ---#
mean(abs(t_stat_store) > c_value) 
```

```
## [1] 0.745
```

<br>

+ clustered error can severely bias your test results
+ it tends to make the impact of explanatory variables more significant than they truly are because the default estimator of the variance of the OLS estimator tends to greatly <span style = "color: red;"> under</span>-estimate the true variance of the OLS estimator.

---
class: middle

# What to do?

There exist estimators of `$Var(\hat{\beta})$` that take into account the possibility that errors are clustered.

+ We call such estimators .blue[cluster-robust variance covariance estimator] denoted as `$(\widehat{Var(\hat{\beta})}_{cl})$`

+ We call estimates from such estimators .blue[cluster-robust variance]

---
class: middle

# Cluster-robust standard error estimation

I neither derive nor show the mathematical expressions of these estimators.

+ understand the consequence of clustered errors

+ know there are estimators that are appropriate under clustered error

+ know that the estimators we will learn take care of heteroskedasticity at the same time (so, they really are cluster- and heteroskedasticity-robust standard error estimators)

+ know how to use the estimators in `$R$` (or some other software)

---
class: middle

# R implementation

Similar with the `vcov` option for White-Huber heteroskedasticity-robust se, we can use the `cluster` option to get clsuter-robust se.

---
class: middle

# Before an R demonstration

Let's take a look at the MLB data again.

```r
dplyr::select(mlb1, salary, years, bavg, nl) %>% head()
```

```
##    salary years bavg nl
## 1 6329213    12  289  1
## 2 3375000     8  259  1
## 3 3100000     5  299  1
## 4 2900000     8  245  1
## 5 1650000    12  258  1
## 6  700000    17  286  1
```

`nl` (1 if in the National league, 0 if in the American league) is the group variable we cluster around.

---
class: middle

# R Demonstration

Run a regression

```r
reg_mlb <- feols(log(salary) ~ years + bavg, data = mlb1) 
```

<br>

Apply `vcov()` or `se()` with the `cluster =` option.

```r
#* vcov clustered by nl
vcov(reg_mlb, cluster = ~ nl)

#* se clustered by nl
se(reg_mlb, cluster = ~ nl)
```

---
class: middle

# Compare

```r
se(reg_mlb) 
```

```
## (Intercept)       years        bavg 
## 0.343071420 0.013222511 0.001335294
```

<br>

```r
se(reg_mlb, cluster = ~ nl) 
```

```
##  (Intercept)        years         bavg 
## 0.2495162012 0.0125500623 0.0007155029
```

---
class: middle

# R Demonstration

Or, you could add the `cluster` option like below inside `feols()`.

```r
reg_mlb <- feols(log(salary) ~ years + bavg, cluster = ~ nl, data = mlb1)

tidy(reg_mlb)
```

```
## # A tibble: 3 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) 11.0      0.250        44.3   0.0144
## 2 years        0.166    0.0126       13.3   0.0479
## 3 bavg         0.00539  0.000716      7.54  0.0840
```

---
class: middle

# In practice

Just like the heteroskedasticity-present case before,

+ Estimate the model using OLS (you do nothing special here)

+ Assume the error term is clustered and/or heteroskedastic, and estimate the variance of the OLS estimators `$(Var(\hat{\beta}))$` using cluster-robust standard error estimators

+ Replace the estimates from `$\widehat{Var(\hat{\beta})}_{default}$` with those from `$\widehat{Var(\hat{\beta})}_{cl}$` for testing

+ But, we do not replace coefficient estimates.
---
class: middle

# But does it really work?

Let's run MC simulations to see if the use of the cluster-robust standard error estimation method works

---
class: middle

# MC simulation results: R code

```r
set.seed(58934)
B <- 1000
t_stat_store <- rep(0,B)
N <- 2000 # number of observations per cluster
G <- 50 # number of groups
Ng <- N/G # number of observations per group

for (i in 1:B){
  #--- error correlated within group ---#
  u <- 
  mvrnorm(
    G, mu = rep(0, Ng), 
    Sigma = matrix(10, nrow = Ng, ncol = Ng) + diag(Ng)
  ) %>% t() %>% c()

#--- x correlated within group ---#
  x <- 
    mvrnorm(
      G, mu = rep(0, Ng), 
      Sigma = matrix(1, nrow = Ng, ncol = Ng) + diag(Ng) * .2
    ) %>% t() %>% c()

#--- other variables ---#
  y <- 1 + 0 * x + u

#--- data.frame ---#
  data <- data.frame(y = y, x = x, group = rep(1:G, each = Ng))

#--- OLS with cluster-robust se---#
* reg <- feols(y ~ x, data = data, cluster = ~ group)

#--- get vcov ---#
  se_cl <- se(reg)["x"]

#--- calculate t-stat ---#
  t_stat <- reg$coefficient['x']/se_cl
  t_stat_store[i] <- t_stat
}
```

---
class: middle

# MC simulation results

```r
#--- critical value ---#
c_value <- qt(0.95, N-2)

#--- how often do you reject the null ---#
mean(abs(t_stat_store) > c_value)
```

```
## [1] 0.15
```

Well, we are still rejecting too often than we should, but it is much better than the default VCOV estimator that rejected 74% of the time.

<br>

+ Cluster-robust standard error estimation gets better as the number of groups gets larger

+ The number of groups of 2 is too small (the MLB case)
+ As a rule of thumb, \# of groups larger than 50 is sufficiently large, but we just saw we still over-rejected the null of `$\beta = 0$` three times more than we should.

---
class: middle

```r
set.seed(58934)
B <- 1000
t_stat_store <- rep(0,B)
N <- 20000 # total number of observations
*G <- 1000 # number of groups
Ng <- N/G # number of observations per group

for (i in 1:B){
  #--- error correlated within group ---#
  u <- 
  mvrnorm(
    G, mu = rep(0, Ng), 
    Sigma = matrix(10, nrow = Ng, ncol = Ng) + diag(Ng)
  ) %>% t() %>% c()

#--- x correlated within group ---#
  x <- 
    mvrnorm(
      G, mu = rep(0, Ng), 
      Sigma = matrix(1, nrow = Ng, ncol = Ng) + diag(Ng) * .2
    ) %>% t() %>% c()

#--- other variables ---#
  y <- 1 + 0 * x + u

#--- data.frame ---#
  data <- data.frame(y = y, x = x, group = rep(1:G, each = Ng))

#--- OLS with cluster-robust se---#
  reg <- feols(y ~ x, data = data, cluster = ~ group)

#--- get vcov ---#
  se_cl <- se(reg)["x"]

#--- calculate t-stat ---#
  t_stat <- reg$coefficient['x']/se_cl
  t_stat_store[i] <- t_stat
}
```

---
class: middle

# MC simulation results

```r
#--- critical value ---#
c_value <- qt(0.95, N-2)

#--- how often do you reject the null ---#
mean(abs(t_stat_store) > c_value)
```

```
## [1] 0.09
```

Better. But, we are still over-rejecting. Don't forget it is certianly better than using the default!