class: center, middle, inverse, title-slide # OLS Asymptotics ### AECN 396/896-002 --- class: middle <style type="text/css"> @media print { .has-continuation { display: block !important; } } .remark-slide-content.hljs-github h1 { margin-top: 5px; margin-bottom: 25px; } .remark-slide-content.hljs-github { padding-top: 10px; padding-left: 30px; padding-right: 30px; } .panel-tabs { <!-- color: #062A00; --> color: #841F27; margin-top: 0px; margin-bottom: 0px; margin-left: 0px; padding-bottom: 0px; } .panel-tab { margin-top: 0px; margin-bottom: 0px; margin-left: 3px; margin-right: 3px; padding-top: 0px; padding-bottom: 0px; } .panelset .panel-tabs .panel-tab { min-height: 40px; } .remark-slide th { border-bottom: 1px solid #ddd; } .remark-slide thead { border-bottom: 0px; } .gt_footnote { padding: 2px; } .remark-slide table { border-collapse: collapse; } .remark-slide tbody { border-bottom: 2px solid #666; } .important { background-color: lightpink; border: 2px solid blue; font-weight: bold; } .remark-code { display: block; overflow-x: auto; padding: .5em; background: #ffe7e7; } .hljs-github .hljs { background: #f2f2fd; } .remark-inline-code { padding-top: 0px; padding-bottom: 0px; background-color: #e6e6e6; } .r.hljs.remark-code.remark-inline-code{ font-size: 0.9em } .left-full { width: 80%; height: 92%; float: left; } .left-code { width: 38%; height: 92%; float: left; } .right-plot { width: 60%; float: right; padding-left: 1%; } .left5 { width: 49%; height: 92%; float: left; } .right5 { width: 49%; float: right; padding-left: 1%; } .left3 { width: 29%; height: 92%; float: left; } .right7 { width: 69%; float: right; padding-left: 1%; } .left4 { width: 38%; height: 92%; float: left; } .right6 { width: 60%; float: right; padding-left: 1%; } ul li{ margin: 7px; } ul, li{ margin-left: 15px; padding-left: 0px; } ol li{ margin: 7px; } ol, li{ margin-left: 15px; padding-left: 0px; } </style> <style type="text/css"> .content-box { box-sizing: border-box; background-color: #e2e2e2; } .content-box-blue, .content-box-gray, .content-box-grey, .content-box-army, .content-box-green, .content-box-purple, .content-box-red, .content-box-yellow { box-sizing: border-box; border-radius: 5px; margin: 0 0 10px; overflow: hidden; padding: 0px 5px 0px 5px; width: 100%; } .content-box-blue { background-color: #F0F8FF; } .content-box-gray { background-color: #e2e2e2; } .content-box-grey { background-color: #F5F5F5; } .content-box-army { background-color: #737a36; } .content-box-green { background-color: #d9edc2; } .content-box-purple { background-color: #e2e2f9; } .content-box-red { background-color: #ffcccc; } .content-box-yellow { background-color: #fef5c4; } .content-box-blue .remark-inline-code, .content-box-blue .remark-inline-code, .content-box-gray .remark-inline-code, .content-box-grey .remark-inline-code, .content-box-army .remark-inline-code, .content-box-green .remark-inline-code, .content-box-purple .remark-inline-code, .content-box-red .remark-inline-code, .content-box-yellow .remark-inline-code { background: none; } .full-width { display: flex; width: 100%; flex: 1 1 auto; } </style> <style type="text/css"> blockquote, .blockquote { display: block; margin-top: 0.1em; margin-bottom: 0.2em; margin-left: 5px; margin-right: 5px; border-left: solid 10px #0148A4; border-top: solid 2px #0148A4; border-bottom: solid 2px #0148A4; border-right: solid 2px #0148A4; box-shadow: 0 0 6px rgba(0,0,0,0.5); /* background-color: #e64626; */ color: #e64626; padding: 0.5em; -moz-border-radius: 5px; -webkit-border-radius: 5px; } .blockquote p { margin-top: 0px; margin-bottom: 5px; } .blockquote > h1:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h2:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h3:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h4:first-of-type { margin-top: 0px; margin-bottom: 5px; } .text-shadow { text-shadow: 0 0 4px #424242; } </style> <style type="text/css"> /****************** * Slide scrolling * (non-functional) * not sure if it is a good idea anyway slides > slide { overflow: scroll; padding: 5px 40px; } .scrollable-slide .remark-slide { height: 400px; overflow: scroll !important; } ******************/ .scroll-box-8 { height:8em; overflow-y: scroll; } .scroll-box-10 { height:10em; overflow-y: scroll; } .scroll-box-12 { height:12em; overflow-y: scroll; } .scroll-box-14 { height:14em; overflow-y: scroll; } .scroll-box-16 { height:16em; overflow-y: scroll; } .scroll-box-18 { height:18em; overflow-y: scroll; } .scroll-box-20 { height:20em; overflow-y: scroll; } .scroll-box-24 { height:24em; overflow-y: scroll; } .scroll-box-30 { height:30em; overflow-y: scroll; } .scroll-output { height: 90%; overflow-y: scroll; } </style> # Before we start ## Learning objectives Understand the consequences of the violation of the homoskedasticity assumption and how to deal with the problem ## Table of contents 1. [Review on statistical hypothesis testing](#review) 2. [Testing (linear model)](#linear) 3. [Confidence interval](#conf) --- class: middle # OLS Asymptotics .content-box-red[**Large Sample Properties of OLS**] + Properties of OLS that hold only when the sample size is infinite .blue[very] large + (loosely put) How OLS estimators behave when the number of observations goes .blue[infinite (really large)] <br> .content-box-red[**Small Sample Properties of OLS**] Under certain conditions: + OLS estimators are unbiased + OLS estimators are efficient These hold .blue[whatever the sample size is]. --- class: inverse, center, middle name: review # Consistency <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- class: middle # Consistency .content-box-green[**Verbally (and very loosely)**] An estimator is .blue[consistent] if the probability that the estimator produces the true parameter is 1 when sample size is infinite. --- class: middle # MC simulation: consistency of OLS estimators OLS estimator of the coefficient on `\(x\)` in the following model with all `\(MLR.1\)` through `\(MLR.4\)` satisfied: `$$y_i = \beta_0 + \beta_1 x_i + u_i$$` with all the conditions necessary for the unbiasedness property of OLS satisfied. --- class: middle # Conceptual steps of the MC simulations + Generate data according to `\(y_i = \beta_0 + \beta_1 x_i + u_i\)` + Estimate the coefficients and store them + Repeat the above experiment 1000 times + Examine how the coefficient estimates are distributed -- <br> .content-box-green[**What you should see is**] As `\(N\)` gets larger (more observations), the distribution of `\(\hat{\beta}_1\)` get more tightly centered around its true value (here, `\(1\)`) --- class: middle ```r #--- Preparation ---# B <- 1000 # the number of iterations N_list <- c(100, 1000, 10000) # sample size N_len <- length(N_list) estimate_storage <- matrix(0, B, 3) # estimates storage for (j in 1:N_len) { temp_N <- N_list[j] for (i in 1:B) { #--- generate data ---# x <- rnorm(temp_N) # indep var 1 u <- rnorm(temp_N) * 0.2 # error y <- 1 + x + u # dependent variable 1 data <- data.frame(y = y, x = x) #--- OLS ---# reg <- lm(y ~ x, data = data) # OLS #--- store coef estimates ---# estimate_storage[i, j] <- reg$coef[2] } } ``` --- class: middle ```r plot_data <- melt(data.frame(estimate_storage)) #--- create a figure ---# g_co_ex <- ggplot(data = plot_data) + geom_density(aes(x = value, fill = variable), alpha = 0.4) + geom_vline(xintercept = 1) + xlim(0.9, 1.1) + scale_fill_discrete( name = "Sample Size", labels = c("N=100 ", "N=1,000 ", "N=10,000") ) + theme( legend.position = "bottom" ) ``` --- class: middle <img src="data:image/png;base64,#asymptotics_x_files/figure-html/unnamed-chunk-5-1.png" width="80%" style="display: block; margin: auto;" /> --- class: middle .content-box-red[**Consistency of OLS estimators**] Under `\(MLR.1\)` through `\(MLR.4\)`, OLS estimators are consistent --- class: middle # MC simulations: Inconsistency of OLS estimators .content-box-red[**Conceptual steps of MC simulations**] + generate data ($N$ observations) according to `\(y_i = \beta_0 + \beta_1 x_i + u_i\)` with `\(E[u_i|x_i]\ne 0\)` + Estimate the coefficients and store them + repeat the above experiment 1000 times + examine how the coefficient estimates are distributed -- .content-box-green[**Question**] Would the bias disappear as N gets larger? --- class: middle ```r #--- Preparation ---# N_list <- c(100, 1000, 10000) # sample size N_len <- length(N_list) estimate_storage <- matrix(0, B, 3) # estimates storage for (j in 1:N_len) { temp_N <- N_list[j] for (i in 1:B) { #--- generate data ---# * mu <- rnorm(temp_N) # shared term between x and u x <- rnorm(temp_N) + 0.5 * mu # << u <- rnorm(temp_N) + 0.5 * mu # << y <- 1 + x + u # dependent variable data <- data.frame(y = y, x = x) #--- OLS ---# reg <- lm(y ~ x, data = data) # OLS #--- store coef estimates ---# estimate_storage[i, j] <- reg$coef[2] } } ``` --- class: middle ```r #--- wide to long format ---# plot_data <- melt(data.frame(estimate_storage)) #--- create a figure ---# g_inco_ex <- ggplot(data = plot_data) + geom_density(aes(x = value, fill = variable), alpha = 0.4) + geom_vline(xintercept = 1) + scale_fill_discrete( name = "Sample Size", labels = c("N=100 ", "N=1,000 ", "N=10,000") ) + theme( legend.position = "bottom" ) ``` --- class: middle <img src="data:image/png;base64,#asymptotics_x_files/figure-html/unnamed-chunk-8-1.png" width="80%" style="display: block; margin: auto;" /> --- class: inverse, center, middle name: review # Asymptotic Normality <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- class: middle # Testing When we talked about hypothesis testing, we made the following assumption: .content-box-red[**Normality assumption**] The population error `\(u\)` is .blue[independent] of the explanatory variables `\(x_1,\dots,x_k\)` and is .blue[normally] distributed with zero mean and variance `\(\sigma^2\)`: `\(u\sim Normal(0,\sigma^2)\)` -- <br> .content-box-green[**Remember**] + If the normality assumption is violated, t-statistic and F-statistic we constructed before are no longer distributed as t-distribution and F-distribution, respectively + So, whenever `\(MLR.6\)` is violated, our t- and F-tests are invalid --- class: middle .content-box-green[**Fortunately**] You can continue to use t- and F-tests because (slightly transformed) OLS estimators are .blue[approximately] normally distributed when the sample size is .blue[large enough]. --- class: middle # Central Limit Theorem (CLT) Suppose `\(\{x_1,x_2,\dots\}\)` is a sequence of idetically independently distributed random variables with `\(E[x_i]=\mu\)` and `\(Var[x_i]=\sigma^2<\infty\)`. Then, as `\(n\)` approaches infinity, `$$\sqrt{n}(\frac{1}{n} \sum_{i=1}^n x_i-\mu)\xrightarrow[]{d} N(0,\sigma^2)$$` -- .content-box-green[**Verbally**] Sample mean less its expected value multiplied by `\(\sqrt{n}\)` (square root of the sample size) is going to be distributed as Normal distribution as `\(n\)` goes infinity where its expected value is 0 and variance is the variance of `\(x\)`. --- class: middle .content-box-red[**Example**] `\(x_i \sim Bern[p=0.3]\)` 1 with probability `\(p\)` and 0 with probability `\(1-p\)`. + `\(E[x_i] = p = 0.3\)` + `\(Var[x_i] (\sigma^2) = p(1-p) = 0.21\)` -- <br> .content-box-green[**According to CLT**] `\(\Big(\sqrt{n}(\frac{1}{n} \sum_{i=1}^n x_i-\mu)\xrightarrow[]{d} N(0,\sigma^2)\Big)\)` `\(\sqrt{n}(\frac{1}{n} \sum_{i=1}^n x_i-0.3)\xrightarrow[]{d} N(0,0.21)\)` --- class: middle # MC simulations: CLT .content-box-red[**Conceptual steps of the MC simulation**] + draw `\(n\)` observations from `\(x_i\sim Bern(0.3)\)` + find its mean, subtract the expected value (here, `\(E[x_i]=0.3\)`), multiply by `\(\sqrt{n}\)` `\((\sqrt{n}(\frac{1}{n} \sum_{i=1}^n x_i-\mu)\)` + store the calculated value + repeat the above experiment 1000 times + examine how the calculated values are distributed -- <br> .content-box-green[**What you should see is**] As `\(N\)` gets larger (more observations), the distribution of `\(\sqrt{n}(\frac{1}{n} \sum_{i=1}^n x_i-0.3\)`) looks more and more like `\(N(0,0.21)\)` --- class: middle .content-box-red[**MC simulations (N = 10)**] ```r set.seed(893269) #--- the number of observations ---# # this is what we change *N <- 10 # number of observations B <- 1000 # number of iterations p <- 0.3 # mean of the Bernoulli distribution storage <- rep(0, B) for (i in 1:B) { #--- draw from Bern[0.3] (x distributed as Bern[0.3]) ---# x_seq <- runif(N) <= p #--- sample mean ---# x_mean <- mean(x_seq) #--- normalize ---# lhs <- sqrt(N) * (x_mean - p) #--- save lhs to storage ---# storage[i] <- lhs } ``` --- class: middle .content-box-red[**Visualization**] ```r data_pdf <- data.frame( x = seq(-2, 2, length = 1000), y = dnorm(seq(-2, 2, length = 1000), sd = sqrt(p * (1 - p))) ) g_N_10 <- ggplot() + geom_density( data = data.frame(x = storage), aes(x = x, color = "sample distribution") ) + geom_line( data = data_pdf, aes(y = y, x = x, color = "pdf of N(0,0.21)") ) + scale_color_manual( values = c("sample distribution" = "blue", "pdf of N(0,0.21)" = "red"), name = "" ) + theme( legend.position = "bottom" ) ``` --- class: middle .content-box-red[**MC simulations (N = 10)**] <img src="data:image/png;base64,#asymptotics_x_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" /> --- class: middle .content-box-red[**MC simulations (N = 100)**] <img src="data:image/png;base64,#asymptotics_x_files/figure-html/unnamed-chunk-13-1.png" width="80%" style="display: block; margin: auto;" /> --- class: middle .content-box-red[**MC simulations (N = 10000)**] <img src="data:image/png;base64,#asymptotics_x_files/figure-html/unnamed-chunk-15-1.png" width="80%" style="display: block; margin: auto;" /> --- class: middle .content-box-red[**Important**] CLT holds for .blue[any] distribution of `\(x_i\)` .blue[as long as it has a finite expected value and variance]. --- class: middle Under assumptions `\(MLR.1\)` through `\(MLR.5\)` (MLR.6 not necessary!!), .content-box-red[**Asymptotic Normality of OLS**] `\(\sqrt{n}(\hat{\beta}_j-\beta_j)\xrightarrow[]{a} N(0,\sigma^2/\alpha_j^2)\)` where `\(\alpha_j^2=plim(\frac{1}{n}\sum_{i=1}^n r^2_{i,j})\)`, where `\(r^2_{i,j}\)` are the residuals from regressing `\(x_j\)` on the other independent variables. <br> .content-box-red[**Consistency**] `\(\hat{\sigma}^2\equiv \frac{1}{n-k-1}\sum_{i=1}^n \hat{u}_i^2\)` is a consistent estimator of of `\(\sigma^2\)` `\((Var(u))\)` --- class: middle .content-box-red[**Further**] + `\((\hat{\beta}_j-\beta_j)/se(\hat{\beta}_j) \xrightarrow[]{a} N(0,1)\)` + `\((\hat{\beta}_j-\beta_j)/\widehat{se(\hat{\beta}_j)} \xrightarrow[]{a} N(0,1)\)`, where `\(\widehat{se(\hat{\beta}_j)}=\sqrt{\frac{\hat{\sigma}^2}{SST_j(1-R_j^2)}}\)` --- class: middle .content-box-red[**Small sample (any sample size)**] Under `\(MLR.1\)` through `\(MLR.5\)` .blue[and] `\(MLR.6\)` `\((u_i\sim N(0,\sigma^2))\)`, + `\((\hat{\beta}_j-\beta_j)/sd(\hat{\beta}_j) \sim N(0,1)\)` + `\((\hat{\beta}_j-\beta_j)/se(\hat{\beta}_j) \sim t_{n-k-1}\)` <br> .content-box-red[**Large sample (when `\(n\)` goes infinity)**] Under `\(MLR.1\)` through `\(MLR.5\)` .blue[without] `\(MLR.6\)`, + `\((\hat{\beta}_j-\beta_j)/sd(\hat{\beta}_j) \xrightarrow[]{a} N(0,1)\)` + `\((\hat{\beta}_j-\beta_j)/se(\hat{\beta}_j) \xrightarrow[]{a} N(0,1)\)` --- class: middle # Testing under large sample .content-box-red[**It turns out,**] You can proceed exactly the same way as you did before (practically speaking)!! + calculate `\((\hat{\beta}_j-\beta_j)/\widehat{se(\hat{\beta}_j)}\)` + check if the obtained value is greater than (in magnitude) the critical value for the specified significance level under `\(t_{n-k-1}\)` -- .content-box-red[**But,**] Shouldn't we use `\(N(0,1)\)` when you find the critical value? --- class: middle <img src="data:image/png;base64,#asymptotics_x_files/figure-html/unnamed-chunk-16-1.png" width="80%" style="display: block; margin: auto;" /> --- class: middle .content-box-red[**Testing under large sample**] Since `\(t_{n-k-1}\)` and `\(N(0,1)\)` are almost identical when `\(n\)` is large, there is very little error in using `\(t_{n-k-1}\)` instead of `\(N(0,1)\)` to find the critical value. --- class: middle # When the homoskedasticity is violated (as almost always the case) .content-box-red[**Important**] The consistency of the estimation of `\(\widehat{Var(\hat{\beta})}\)` <span style = "color: red;"> DOES </span> require the homoskedasticity assumption (MLR.5)!! + the usual t-statistics and confidence intervals are invalid no matter how large the sample size is if error is heteroskedastic + so, we should use heteroskedasticity-robust or cluster-robust standard error estimators even when the sample size is large