class: center, middle, inverse, title-slide .title[ # Standard Error Estimation ] .author[ ### AECN 396/896-002 ] --- class: middle <style type="text/css"> @media print { .has-continuation { display: block !important; } } .remark-slide-content.hljs-github h1 { margin-top: 5px; margin-bottom: 25px; } .remark-slide-content.hljs-github { padding-top: 10px; padding-left: 30px; padding-right: 30px; } .panel-tabs { <!-- color: #062A00; --> color: #841F27; margin-top: 0px; margin-bottom: 0px; margin-left: 0px; padding-bottom: 0px; } .panel-tab { margin-top: 0px; margin-bottom: 0px; margin-left: 3px; margin-right: 3px; padding-top: 0px; padding-bottom: 0px; } .panelset .panel-tabs .panel-tab { min-height: 40px; } .remark-slide th { border-bottom: 1px solid #ddd; } .remark-slide thead { border-bottom: 0px; } .gt_footnote { padding: 2px; } .remark-slide table { border-collapse: collapse; } .remark-slide tbody { border-bottom: 2px solid #666; } .important { background-color: lightpink; border: 2px solid blue; font-weight: bold; } .remark-code { display: block; overflow-x: auto; padding: .5em; background: #ffe7e7; } .hljs-github .hljs { background: #f2f2fd; } .remark-inline-code { padding-top: 0px; padding-bottom: 0px; background-color: #e6e6e6; } .r.hljs.remark-code.remark-inline-code{ font-size: 0.9em } .left-full { width: 80%; height: 92%; float: left; } .left-code { width: 38%; height: 92%; float: left; } .right-plot { width: 60%; float: right; padding-left: 1%; } .left5 { width: 49%; height: 92%; float: left; } .right5 { width: 49%; float: right; padding-left: 1%; } .left3 { width: 29%; height: 92%; float: left; } .right7 { width: 69%; float: right; padding-left: 1%; } .left4 { width: 38%; height: 92%; float: left; } .right6 { width: 60%; float: right; padding-left: 1%; } ul li{ margin: 7px; } ul, li{ margin-left: 15px; padding-left: 0px; } ol li{ margin: 7px; } ol, li{ margin-left: 15px; padding-left: 0px; } </style> <style type="text/css"> .content-box { box-sizing: border-box; background-color: #e2e2e2; } .content-box-blue, .content-box-gray, .content-box-grey, .content-box-army, .content-box-green, .content-box-purple, .content-box-red, .content-box-yellow { box-sizing: border-box; border-radius: 5px; margin: 0 0 10px; overflow: hidden; padding: 0px 5px 0px 5px; width: 100%; } .content-box-blue { background-color: #F0F8FF; } .content-box-gray { background-color: #e2e2e2; } .content-box-grey { background-color: #F5F5F5; } .content-box-army { background-color: #737a36; } .content-box-green { background-color: #d9edc2; } .content-box-purple { background-color: #e2e2f9; } .content-box-red { background-color: #ffcccc; } .content-box-yellow { background-color: #fef5c4; } .content-box-blue .remark-inline-code, .content-box-blue .remark-inline-code, .content-box-gray .remark-inline-code, .content-box-grey .remark-inline-code, .content-box-army .remark-inline-code, .content-box-green .remark-inline-code, .content-box-purple .remark-inline-code, .content-box-red .remark-inline-code, .content-box-yellow .remark-inline-code { background: none; } .full-width { display: flex; width: 100%; flex: 1 1 auto; } </style> <style type="text/css"> blockquote, .blockquote { display: block; margin-top: 0.1em; margin-bottom: 0.2em; margin-left: 5px; margin-right: 5px; border-left: solid 10px #0148A4; border-top: solid 2px #0148A4; border-bottom: solid 2px #0148A4; border-right: solid 2px #0148A4; box-shadow: 0 0 6px rgba(0,0,0,0.5); /* background-color: #e64626; */ color: #e64626; padding: 0.5em; -moz-border-radius: 5px; -webkit-border-radius: 5px; } .blockquote p { margin-top: 0px; margin-bottom: 5px; } .blockquote > h1:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h2:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h3:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h4:first-of-type { margin-top: 0px; margin-bottom: 5px; } .text-shadow { text-shadow: 0 0 4px #424242; } </style> <style type="text/css"> /****************** * Slide scrolling * (non-functional) * not sure if it is a good idea anyway slides > slide { overflow: scroll; padding: 5px 40px; } .scrollable-slide .remark-slide { height: 400px; overflow: scroll !important; } ******************/ .scroll-box-8 { height:8em; overflow-y: scroll; } .scroll-box-10 { height:10em; overflow-y: scroll; } .scroll-box-12 { height:12em; overflow-y: scroll; } .scroll-box-14 { height:14em; overflow-y: scroll; } .scroll-box-16 { height:16em; overflow-y: scroll; } .scroll-box-18 { height:18em; overflow-y: scroll; } .scroll-box-20 { height:20em; overflow-y: scroll; } .scroll-box-24 { height:24em; overflow-y: scroll; } .scroll-box-30 { height:30em; overflow-y: scroll; } .scroll-output { height: 90%; overflow-y: scroll; } </style> # Before we start ## Learning objectives Understand the consequences of the violation of the homoskedasticity assumption and how to deal with the problem ## Table of contents 1. [Review on statistical hypothesis testing](#review) 2. [Testing (linear model)](#linear) 3. [Confidence interval](#conf) --- class: inverse, center, middle name: review # Heteroskedasticity <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- class: middle # Homoskedasticity and Heteroskedasticity .pull-left[ .content-box-red[**Homoskedasticity**] `$$Var(u|x) = \sigma^2$$` <br> .content-box-red[**Heteroskedasticity**] `$$Var(u|x) = f(x)$$` ] --- class: middle # Visualization <img src="se_estimation_x_files/figure-html/unnamed-chunk-4-1.png" width="90%" style="display: block; margin: auto;" /> --- class: middle # Central Questions What are the consequences of assuming the error is homoskedastic when it is heteroskedastic in reality? + Estimation of coefficients `\((\hat{\beta}_j)\)`? + Estimation of the variance of `\(\hat{\beta}_j\)`? --- class: middle # Coefficient estimators .content-box-green[**Question**] -- Are OLS estimators unbiased when error is heteroskedastic? -- .content-box-green[**Answer**] -- Yes -- .content-box-green[**Why?**] -- We do not use the homoskedasticity assumption to prove that the OLS estimator is unbiased. --- class: middle # Variance of the coefficient estimators We learned that when the homoskedasticity assumption holds, then, `\(Var(\hat{\beta}_j) = \frac{\sigma^2}{SST_x(1-R^2_j)}\)` -- We used the following as the estimator of `\(Var(\hat{\beta}_j)\)` `\(\frac{\hat{\sigma}^2}{SST_x(1-R^2_j)}\)` where `\(\hat{\sigma}^2 = \frac{\sum_{i=1}^{N} \hat{u}_i^2}{N-k-1}\)` -- <br> .content-box-red[**Important**]: By default, R and other statistical software uses this formula to get estimates of the variance of `\(\hat{\beta}_j\)`. .content-box-green[**Note**]: Remember, we let `\(\widehat{Var(\hat{\beta_j})}\)` denote the estimator of the variance of `\(\hat{\beta}_j\)`. --- class: middle # Variance of the coefficient estimators But, under heteroskedasticity, `\(Var(\hat{\beta}_j) \ne \frac{\sigma^2}{SST_x(1-R^2_j)}\)` -- .content-box-green[**Question**]: Is `\(E[\widehat{Var(\hat{\beta}_j)}]=E\Big[\frac{\hat{\sigma}^2}{SST_x(1-R^2_j)}\Big]=Var(\hat{\beta}_j)\)` under heteroskedasticity? -- .content-box-green[**Answer**]: No --- class: middle # Variance of the coefficient estimators .content-box-green[**Question**]: So, what are the consequences of using `\(\widehat{Var(\hat{\beta}_j)}=\frac{\hat{\sigma}^2}{SST_x(1-R^2_j)}\)` under heteroskedasticity? .content-box-red[**Consequence**] Your hypothesis testing is going to be biased. What does it mean to have hypothesis testing biased? Roughly speaking, it means that you over-reject/under-reject the hypothesis than you intend to. --- class: middle # Consequence of heteroskedasticity on testing Let's run MC simulations to see the consequence of ignoring heteroskedasticity. -- .content-box-green[**Model**] `\(y = 1 + \beta x + u\)`, where `\(\beta = 0\)` -- .content-box-red[**Test of interest**] + `\(H_0:\)` `\(\beta=0\)` + `\(H_1:\)` `\(\beta \ne 0\)` -- .content-box-green[**Question**] If you test the null hypothesis at the `\(5\%\)` significance level, what should be the probability that you reject the null hypothesis when it is actually true? `\(Pr(\mbox{reject} \;\; H_0|H_0 \;\; \mbox{is true})=?\)` --- class: middle .content-box-green[**Question**] If you test the null hypothesis at the `\(5\%\)` significance level, what should be the probability that you reject the null hypothesis when it is actually true? <img src="se_estimation_x_files/figure-html/unnamed-chunk-5-1.png" width="40%" style="display: block; margin: auto;" /> -- .content-box-green[**Answer**] Since the null is true (we generate the data that way!), the probability you reject the null should be the same as the significance level, which is `\(5%\)`. --- class: middle # MC simulation: conceptual steps + generate a dataset so that `\(\beta_1\)` (the coefficient on `\(x\)`) is zero `$$y=\beta_0+\beta_1 x + v$$` + estimate the model and find `\(\hat{\beta}_1\)` and `\(\widehat{se(\hat{\beta}_1)}\)` + calculate `\(t\)`-statistic `\((\hat{\beta}_x-0/\widehat{se(\hat{\beta}_x)})\)` and decide whether you reject the null or not + repeat the above 1000 times + check how often you reject the null (should be close to 50 times) --- class: middle # MC simulation: R code ```r set.seed(927834) N <- 1000 # number of observations B <- 1000 # number of simulations b_hat_store <- rep(0, B) # beta hat storage t_stat_store <- rep(0, B) # t-stat storage c_value <- qt(0.975, N - 2) # critical value x <- runif(N, 0, 1) # x (fixed across iterations) for (i in 1:B){ #--- generate data ---# het_u <- 3 * rnorm(N, mean = 0, sd = 2 * x) # heteroskedastic error y <- 1 + het_u # y data_temp <- data.frame(y = y, x = x) #--- regression ---# ols_res <- lm(y ~ x, data = data_temp) b_hat <- ols_res$coef['x'] # coef estimate on x b_hat_store[i] <- b_hat # save the coef estimate vcov_ols <- vcov(ols_res) # get variance covariance matrix t_stat_store[i] <- b_hat / sqrt(vcov_ols['x', 'x']) # calculate t-stat } ``` --- class: middle # MC simulation: Results ```r #--- how many times do you reject? ---# reject_or_not <- abs(t_stat_store) > c_value mean(reject_or_not) ``` ``` ## [1] 0.108 ``` -- <br> .content-box-red[**Consequence of ignoring heteroskedasticity**] We rejected the null hypothesis 10.8% of the time, instead of `\(5\%\)`. -- + So, you are more likely to claim that `\(x\)` has a statistically significant impact than you are supposed to. -- + The use of the formula `\(\frac{\hat{\sigma}^2}{SST_x(1-R^2_j)}\)` seemed to (over/under)-estimate the true variance of the OLS estimators? -- + In general, the direction of bias is ambiguous. --- class: middle # How should we address this problem? + Now, we understand the consequence of heteroskedasticity: `\(\frac{\hat{\sigma}^2}{SST_x(1-R^2_j)}\)` is a biased estimator of `\(Var(\hat{\beta})\)`, which makes any kind of testings based on it invalid. -- + Can we credibly estimate the variance of the OLS estimators? -- <br> .content-box-red[**White-Huber-Eicker heteroskedasticity-robust standard error estimator**] -- + valid in the presence of heteroskedasticity of .red[unknown form] + heteroskedasticity-robust standard error estimator in short --- class: middle .content-box-red[**Heteroskedasticity-robust standard error estimator**] `\(\widehat{Var(\hat{\beta}_j)} = \frac{\sum_{i=1}^n \hat{r}^2_{i,j} \hat{u}^2_i}{SSR^2_j}\)` + `\(\hat{u}_i\)`: residual from regressing `\(y\)` on all the independent variables + `\(\hat{r}_{i,j}\)`: residual from regressing `\(x_j\)` on all other independent variables for `\(i\)`th observation + `\(SSR^2_j\)`: the sum of squared residuals from regressing `\(x_j\)` on all other independent variables --- class: middle We spend .red[NO] time to try to understand what's going on with the estimator. What you need is + understand the consequence of heteroskedasticity + know there is an estimator that is appropriate under heteroskedasticity, meaning that it will give you the correct estimate of the variance of the OLS estimator + know how to use the heteroskedasticity-robust standard error estimator in practice using `\(R\)` (or some other software) --- class: middle # In practice Here is the well-accepted procedure in econometric analysis: -- + Estimate the model using OLS (you do nothing special here) -- + Assume the error term is heteroskedastic and estimate the variance of the OLS estimators * There are tests to whether error is heteroskedastic or not: .red[Breusch-Pagan] test and .red[White] test * In practice, almost nobody bothers to conduct these tests * We do not learn how to run these tests -- + Replace the estimates from `\(\widehat{Var(\hat{\beta})}_{default}\)` with those from `\(\widehat{Var(\hat{\beta})}_{robust}\)` for testing -- + But, we do not replace coefficient estimates (remember, coefficient estimation is still unbiased under heteroskedasticity) --- class: middle # Implementation in R We use + the `fixest::se()` function from the `fixest` package to estimate heteroskedasticity-robust standard errors + the `summary()` function do tests of `\(\beta_j = 0\)` -- Let's run a regression using `MLB1.dta`. ```r #--- library ---# library(wooldridge) #--- import the data ---# data("mlb1") #--- regression ---# reg_mlb <- feols(log(salary) ~ years + bavg, data = mlb1) ``` --- class: middle # Obtaining Heteroskedasticity-robust SE estimates .content-box-red[**General Syntax**] Here is the general syntax to obtain various types of VCOV (and se) esimaties: ```r #* vcov vcov(regression result, vcov = "type of vcov") #* only the standard errors se(regression result, vcov = "type of vcov") ``` <br> .content-box-green[**heteroskedasticity-robust standard error estimation**] Specifically for White-Huber heteroskedasticity-robust VCOV and se estimates, ```r #* vcov vcov(reg_mlb, vcov = "hetero") ``` ``` ## (Intercept) years bavg ## (Intercept) 0.495103882 0.0058059916 -2.080065e-03 ## years 0.005805992 0.0003117152 -2.976110e-05 ## bavg -0.002080065 -0.0000297611 8.892009e-06 ``` ```r #* only the standard errors se(reg_mlb, vcov = "hetero") ``` ``` ## (Intercept) years bavg ## 0.703636186 0.017655458 0.002981947 ``` --- class: middle # Compare with the Default .content-box-red[**Default**] ```r ( se_hom <- se(reg_mlb) ) ``` ``` ## (Intercept) years bavg ## 0.343071420 0.013222511 0.001335294 ``` <br> .content-box-red[**Heteroskedasticity-robust**] ``` ## (Intercept) years bavg ## 0.703636186 0.017655458 0.002981947 ``` --- class: middle # Updating the test of coefficients being zero .content-box-red[**Default**] ```r tidy(reg_mlb) ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 11.0 0.343 32.2 1.25e-106 ## 2 years 0.166 0.0132 12.6 3.16e- 30 ## 3 bavg 0.00539 0.00134 4.04 6.59e- 5 ``` <br> .content-box-red[**Heteroskedasticity-robust**] ```r tidy(reg_mlb, vcov = "hetero") ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 11.0 0.704 15.7 2.13e-42 ## 2 years 0.166 0.0177 9.43 5.96e-19 ## 3 bavg 0.00539 0.00298 1.81 7.13e- 2 ``` --- class: middle # Reporting the regression results In presenting the regression results in a nicely formatted table, we used `modelsummary::msummary()`. We can easily swap the defulat se with the heteroskedasticity-robust se using the `statistic_override` option in `msummary()`. .left5[ ```r vcov_het <- vcov(reg_mlb, vcov = "hetero") vcov_homo <- vcov(reg_mlb) modelsummary::msummary( list(reg_mlb, reg_mlb), statistic_override = list(vcov_het, vcov_homo), # keep these options as they are stars = TRUE, gof_omit = "IC|Log|Adj|F|Pseudo|Within" ) ``` ] .right5[ <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> (1) </th> <th style="text-align:center;"> (2) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 11.042*** </td> <td style="text-align:center;"> 11.042*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.704) </td> <td style="text-align:center;"> (0.343) </td> </tr> <tr> <td style="text-align:left;"> years </td> <td style="text-align:center;"> 0.166*** </td> <td style="text-align:center;"> 0.166*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.018) </td> <td style="text-align:center;"> (0.013) </td> </tr> <tr> <td style="text-align:left;"> bavg </td> <td style="text-align:center;"> 0.005+ </td> <td style="text-align:center;"> 0.005*** </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1.5px"> </td> <td style="text-align:center;box-shadow: 0px 1.5px"> (0.003) </td> <td style="text-align:center;box-shadow: 0px 1.5px"> (0.001) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 353 </td> <td style="text-align:center;"> 353 </td> </tr> <tr> <td style="text-align:left;"> R2 </td> <td style="text-align:center;"> 0.367 </td> <td style="text-align:center;"> 0.367 </td> </tr> <tr> <td style="text-align:left;"> RMSE </td> <td style="text-align:center;"> 0.94 </td> <td style="text-align:center;"> 0.94 </td> </tr> <tr> <td style="text-align:left;"> Std.Errors </td> <td style="text-align:center;"> Custom </td> <td style="text-align:center;"> Custom </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> ] --- class: middle # R Demonstration Or, you could add the `vcov` option like below inside `feols()`. Then, you do not need `statistic_override` option to override the default VCOV estimates. ```r reg_mlb <- feols(log(salary) ~ years + bavg, vcov = "hetero", data = mlb1) tidy(reg_mlb) ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 11.0 0.704 15.7 2.13e-42 ## 2 years 0.166 0.0177 9.43 5.96e-19 ## 3 bavg 0.00539 0.00298 1.81 7.13e- 2 ``` .left5[ ```r modelsummary::msummary( list(reg_mlb), # keep these options as they are stars = TRUE, gof_omit = "IC|Log|Adj|F|Pseudo|Within" ) ``` ] .right5[ <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> (1) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 11.042*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.704) </td> </tr> <tr> <td style="text-align:left;"> years </td> <td style="text-align:center;"> 0.166*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.018) </td> </tr> <tr> <td style="text-align:left;"> bavg </td> <td style="text-align:center;"> 0.005+ </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1.5px"> </td> <td style="text-align:center;box-shadow: 0px 1.5px"> (0.003) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 353 </td> </tr> <tr> <td style="text-align:left;"> R2 </td> <td style="text-align:center;"> 0.367 </td> </tr> <tr> <td style="text-align:left;"> RMSE </td> <td style="text-align:center;"> 0.94 </td> </tr> <tr> <td style="text-align:left;"> Std.Errors </td> <td style="text-align:center;"> Heteroskedasticity-robust </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> ] --- class: middle # Het-robust SE estimator: validation Does the heteroskedasticity-robust se estimator really work? Let's see using MC simulations: ```r set.seed(478954) #--- x fixed across iterations ---# x <- runif(N,0,1) # x for (i in 1:B){ #--- generate data ---# het_u <- 3 * rnorm(N, mean = 0, sd = 2 * x) # heteroskedastic error y <- 1 + het_u # y data_temp <- data.frame(y = y, x = x) #--- regression ---# ols_res <- feols(y ~ x,data = data_temp) b_hat <- ols_res$coefficient['x'] # coef estimate on x b_hat_store[i] <- b_hat # save the coef estimate * se_het <- se(ols_res, vcov = "hetero")["x"] # get variance covariance matrix t_stat_store[i] <- b_hat/se_het # calculate t-stat } ``` --- class: middle # MC simulation results ```r reject_or_not <- abs(t_stat_store) > c_value mean(reject_or_not) ``` ``` ## [1] 0.053 ``` Okay, not perfect. But, certainly better. --- class: inverse, center, middle name: review # Clustered Error <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- class: middle .content-box-red[**Clustered Error**] + Often times, observations can be grouped into clusters + Errors within the cluster can be correlated --- class: middle .content-box-red[**Example 1**] College GPA: cluster by college `\(GPA_{col} = \beta_0 + \beta_1 income + \beta_2 GPA_{hs} + u\)` + Your observations consist of students' GPA scores across many colleges + Because of some unobserved (omitted) school characteristics, error terms for the individuals in the same college might be correlated. * grading policy --- class: middle .content-box-red[**Example 2**] Eduction Impacts on Income: cluster by individual + Your observations consist of 500 individuals with each individual tracked over 10 years + Because of some unobserved (omitted) individual characteristics, error terms for time-series observations within an individual might be correlated. * innate ability --- class: middle # Consequences of clustered error .content-box-green[**Question**] Are the OLS estimators of the coefficients biased in the presence of clustered error? -- <br> .content-box-green[**Answer**] -- No, the correlation between `\(x\)` and `\(u\)` would hurt you, but not correlation among `\(u\)`. --- class: middle # Consequences of clustered error .content-box-green[**Question**] Are `\(\widehat{Var(\hat{\beta})}_{default}\)` unbiased estimators of `\(Var(\hat{\beta})\)`? -- <br> .content-box-green[**Answer**] -- No, `\(\widehat{Var(\hat{\beta})}_{default}\)` is unbiased only under homoskedasticity assumption, which assumes no correlation between errors. --- class: middle # Consequences of clustered error .content-box-green[**Question**] Which has more information? + two errors that are independent + two errors that are correlated <br> .content-box-red[**Consequences**] + If you were to use `\(\widehat{Var(\hat{\beta})}_{default}\)` to estimate `\(Var(\hat{\beta})\)` in the presence of clustered error, you would (under/over)-estimate the true `\(Var(\hat{\beta})\)`. + This would lead to rejecting null hypothesis (more/less) often than you are supposed to. --- class: middle # MC simulations: conceptual steps Here are the conceptual steps of the MC simulations to see the consequence of clustered error. + generate data according to the generating process in which the error terms `\((u)\)` within the cluster (two clusters in this example) is correlated and `\(\beta_1\)` is set to 0 in the model below: $$ `\begin{aligned} y = \beta_0 + \beta_1 x + u \end{aligned}` $$ + estimate the model and find `\(\hat{\beta}_x\)` and `\(\widehat{se(\hat{\beta}_x)}\)` + calculate `\(t\)`-statistic `\((\hat{\beta}_x/\widehat{se(\hat{\beta}_x)})\)` for the (correct) null hypothesis of `\(\beta_1 = 0\)` + repeat steps 1-3 for 1000 times + see how many times out of 1000 times you reject the null hypothesis: `\(H_0:\)` `\(\beta_x=0\)` --- class: middle # R code: Data Genrating Process ```r #--- setup ---# library(MASS) # to use the mvrnorm() function later N <- 2000 # total number of observations G <- 50 # number of groups Ng <- N/G # number of observations per group #--- error correlated within group ---# u <- mvrnorm( G, mu = rep(0, Ng), Sigma = matrix(10, nrow = Ng, ncol = Ng) + diag(Ng) ) %>% t() %>% c() #--- x correlated within group ---# x <- mvrnorm( G, mu = rep(0, Ng), Sigma = matrix(1, nrow = Ng, ncol = Ng) + diag(Ng) * .2 ) %>% t() %>% c() #--- other variables ---# y <- 1 + 0 * x + u #--- data.frame ---# data <- data.frame(y = y, x = x, group = rep(1:G, each = Ng)) ``` --- class: middle # Visualization of the clustered nature of the data <img src="se_estimation_x_files/figure-html/unnamed-chunk-23-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle # R code: MC simualtion ```r set.seed(58934) B <- 1000 t_stat_store <- rep(0,B) N <- 2000 # total number of observations G <- 50 # number of groups Ng <- N/G # number of observations per group for (i in 1:B){ #--- error correlated within group ---# u <- mvrnorm( G, mu = rep(0, Ng), Sigma = matrix(10, nrow = Ng, ncol = Ng) + diag(Ng) ) %>% t() %>% c() #--- x correlated within group ---# x <- mvrnorm( G, mu = rep(0, Ng), Sigma = matrix(1, nrow = Ng, ncol = Ng) + diag(Ng) * .2 ) %>% t() %>% c() #--- other variables ---# y <- 1 + 0 * x + u #--- data.frame ---# data <- data.frame(y = y, x = x, group = rep(1:G, each = Ng)) #--- OLS ---# reg <- feols(y ~ x, data = data) #--- get vcov ---# se_default <- se(reg)["x"] #--- calculate t-stat ---# t_stat <- reg$coefficient['x']/se_default t_stat_store[i] <- t_stat } ``` --- class: middle # MC simulations: results ```r c_value <- qt(0.975, N - 2) #--- how often do you reject the null ---# mean(abs(t_stat_store) > c_value) ``` ``` ## [1] 0.745 ``` -- <br> .content-box-red[**Important**]: + clustered error can severely bias your test results + it tends to make the impact of explanatory variables more significant than they truly are because the default estimator of the variance of the OLS estimator tends to greatly <span style = "color: red;"> under</span>-estimate the true variance of the OLS estimator. --- class: middle # What to do? .content-box-red[**Cluster-robust standard error estimation**] There exist estimators of `\(Var(\hat{\beta})\)` that take into account the possibility that errors are clustered. + We call such estimators .blue[cluster-robust variance covariance estimator] denoted as `\((\widehat{Var(\hat{\beta})}_{cl})\)` + We call estimates from such estimators .blue[cluster-robust variance] --- class: middle # Cluster-robust standard error estimation I neither derive nor show the mathematical expressions of these estimators. .content-box-red[**This is what you need to do**] + understand the consequence of clustered errors + know there are estimators that are appropriate under clustered error + know that the estimators we will learn take care of heteroskedasticity at the same time (so, they really are cluster- and heteroskedasticity-robust standard error estimators) + know how to use the estimators in `\(R\)` (or some other software) --- class: middle # R implementation .content-box-red[**Cluster-robust standard error**] Similar with the `vcov` option for White-Huber heteroskedasticity-robust se, we can use the `cluster` option to get clsuter-robust se. --- class: middle # Before an R demonstration Let's take a look at the MLB data again. ```r dplyr::select(mlb1, salary, years, bavg, nl) %>% head() ``` ``` ## salary years bavg nl ## 1 6329213 12 289 1 ## 2 3375000 8 259 1 ## 3 3100000 5 299 1 ## 4 2900000 8 245 1 ## 5 1650000 12 258 1 ## 6 700000 17 286 1 ``` `nl` (1 if in the National league, 0 if in the American league) is the group variable we cluster around. --- class: middle # R Demonstration .content-box-red[**Step 1**] Run a regression ```r reg_mlb <- feols(log(salary) ~ years + bavg, data = mlb1) ``` -- <br> .content-box-red[**Step 2**] Apply `vcov()` or `se()` with the `cluster =` option. ```r #* vcov clustered by nl vcov(reg_mlb, cluster = ~ nl) #* se clustered by nl se(reg_mlb, cluster = ~ nl) ``` --- class: middle # Compare .content-box-red[**Default**] ```r se(reg_mlb) ``` ``` ## (Intercept) years bavg ## 0.343071420 0.013222511 0.001335294 ``` <br> .content-box-red[**Cluster-robust standard error**] ```r se(reg_mlb, cluster = ~ nl) ``` ``` ## (Intercept) years bavg ## 0.2495162012 0.0125500623 0.0007155029 ``` --- class: middle # R Demonstration Or, you could add the `cluster` option like below inside `feols()`. ```r reg_mlb <- feols(log(salary) ~ years + bavg, cluster = ~ nl, data = mlb1) tidy(reg_mlb) ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 11.0 0.250 44.3 0.0144 ## 2 years 0.166 0.0126 13.3 0.0479 ## 3 bavg 0.00539 0.000716 7.54 0.0840 ``` --- class: middle # In practice Just like the heteroskedasticity-present case before, + Estimate the model using OLS (you do nothing special here) + Assume the error term is clustered and/or heteroskedastic, and estimate the variance of the OLS estimators `\((Var(\hat{\beta}))\)` using cluster-robust standard error estimators + Replace the estimates from `\(\widehat{Var(\hat{\beta})}_{default}\)` with those from `\(\widehat{Var(\hat{\beta})}_{cl}\)` for testing + But, we do not replace coefficient estimates. --- class: middle # But does it really work? Let's run MC simulations to see if the use of the cluster-robust standard error estimation method works --- class: middle # MC simulation results: R code ```r set.seed(58934) B <- 1000 t_stat_store <- rep(0,B) N <- 2000 # number of observations per cluster G <- 50 # number of groups Ng <- N/G # number of observations per group for (i in 1:B){ #--- error correlated within group ---# u <- mvrnorm( G, mu = rep(0, Ng), Sigma = matrix(10, nrow = Ng, ncol = Ng) + diag(Ng) ) %>% t() %>% c() #--- x correlated within group ---# x <- mvrnorm( G, mu = rep(0, Ng), Sigma = matrix(1, nrow = Ng, ncol = Ng) + diag(Ng) * .2 ) %>% t() %>% c() #--- other variables ---# y <- 1 + 0 * x + u #--- data.frame ---# data <- data.frame(y = y, x = x, group = rep(1:G, each = Ng)) #--- OLS with cluster-robust se---# * reg <- feols(y ~ x, data = data, cluster = ~ group) #--- get vcov ---# se_cl <- se(reg)["x"] #--- calculate t-stat ---# t_stat <- reg$coefficient['x']/se_cl t_stat_store[i] <- t_stat } ``` --- class: middle # MC simulation results ```r #--- critical value ---# c_value <- qt(0.95, N-2) #--- how often do you reject the null ---# mean(abs(t_stat_store) > c_value) ``` ``` ## [1] 0.15 ``` -- Well, we are still rejecting too often than we should, but it is much better than the default VCOV estimator that rejected 74% of the time. -- <br> .content-box-red[**Important**] + Cluster-robust standard error estimation gets better as the number of groups gets larger + The number of groups of 2 is too small (the MLB case) + As a rule of thumb, \# of groups larger than 50 is sufficiently large, but we just saw we still over-rejected the null of `\(\beta = 0\)` three times more than we should. --- class: middle ```r set.seed(58934) B <- 1000 t_stat_store <- rep(0,B) N <- 20000 # total number of observations *G <- 1000 # number of groups Ng <- N/G # number of observations per group for (i in 1:B){ #--- error correlated within group ---# u <- mvrnorm( G, mu = rep(0, Ng), Sigma = matrix(10, nrow = Ng, ncol = Ng) + diag(Ng) ) %>% t() %>% c() #--- x correlated within group ---# x <- mvrnorm( G, mu = rep(0, Ng), Sigma = matrix(1, nrow = Ng, ncol = Ng) + diag(Ng) * .2 ) %>% t() %>% c() #--- other variables ---# y <- 1 + 0 * x + u #--- data.frame ---# data <- data.frame(y = y, x = x, group = rep(1:G, each = Ng)) #--- OLS with cluster-robust se---# reg <- feols(y ~ x, data = data, cluster = ~ group) #--- get vcov ---# se_cl <- se(reg)["x"] #--- calculate t-stat ---# t_stat <- reg$coefficient['x']/se_cl t_stat_store[i] <- t_stat } ``` --- class: middle # MC simulation results ```r #--- critical value ---# c_value <- qt(0.95, N-2) #--- how often do you reject the null ---# mean(abs(t_stat_store) > c_value) ``` ``` ## [1] 0.09 ``` Better. But, we are still over-rejecting. Don't forget it is certianly better than using the default!