class: center, middle, inverse, title-slide # Omitted Variable Bias and Multicollinearity ### AECN 396/896-002 --- <style type="text/css"> @media print { .has-continuation { display: block !important; } } .remark-slide-content.hljs-github h1 { margin-top: 5px; margin-bottom: 25px; } .remark-slide-content.hljs-github { padding-top: 10px; padding-left: 30px; padding-right: 30px; } .panel-tabs { <!-- color: #062A00; --> color: #841F27; margin-top: 0px; margin-bottom: 0px; margin-left: 0px; padding-bottom: 0px; } .panel-tab { margin-top: 0px; margin-bottom: 0px; margin-left: 3px; margin-right: 3px; padding-top: 0px; padding-bottom: 0px; } .panelset .panel-tabs .panel-tab { min-height: 40px; } .remark-slide th { border-bottom: 1px solid #ddd; } .remark-slide thead { border-bottom: 0px; } .gt_footnote { padding: 2px; } .remark-slide table { border-collapse: collapse; } .remark-slide tbody { border-bottom: 2px solid #666; } .important { background-color: lightpink; border: 2px solid blue; font-weight: bold; } .remark-code { display: block; overflow-x: auto; padding: .5em; background: #ffe7e7; } .hljs-github .hljs { background: #f2f2fd; } .remark-inline-code { padding-top: 0px; padding-bottom: 0px; background-color: #e6e6e6; } .r.hljs.remark-code.remark-inline-code{ font-size: 0.9em } .left-full { width: 80%; height: 92%; float: left; } .left-code { width: 38%; height: 92%; float: left; } .right-plot { width: 60%; float: right; padding-left: 1%; } .left5 { width: 49%; height: 92%; float: left; } .right5 { width: 49%; float: right; padding-left: 1%; } .left3 { width: 29%; height: 92%; float: left; } .right7 { width: 69%; float: right; padding-left: 1%; } .left4 { width: 38%; height: 92%; float: left; } .right6 { width: 60%; float: right; padding-left: 1%; } ul li{ margin: 7px; } ul, li{ margin-left: 15px; padding-left: 0px; } ol li{ margin: 7px; } ol, li{ margin-left: 15px; padding-left: 0px; } </style> <style type="text/css"> .content-box { box-sizing: border-box; background-color: #e2e2e2; } .content-box-blue, .content-box-gray, .content-box-grey, .content-box-army, .content-box-green, .content-box-purple, .content-box-red, .content-box-yellow { box-sizing: border-box; border-radius: 5px; margin: 0 0 10px; overflow: hidden; padding: 0px 5px 0px 5px; width: 100%; } .content-box-blue { background-color: #F0F8FF; } .content-box-gray { background-color: #e2e2e2; } .content-box-grey { background-color: #F5F5F5; } .content-box-army { background-color: #737a36; } .content-box-green { background-color: #d9edc2; } .content-box-purple { background-color: #e2e2f9; } .content-box-red { background-color: #ffcccc; } .content-box-yellow { background-color: #fef5c4; } .content-box-blue .remark-inline-code, .content-box-blue .remark-inline-code, .content-box-gray .remark-inline-code, .content-box-grey .remark-inline-code, .content-box-army .remark-inline-code, .content-box-green .remark-inline-code, .content-box-purple .remark-inline-code, .content-box-red .remark-inline-code, .content-box-yellow .remark-inline-code { background: none; } .full-width { display: flex; width: 100%; flex: 1 1 auto; } </style> <style type="text/css"> blockquote, .blockquote { display: block; margin-top: 0.1em; margin-bottom: 0.2em; margin-left: 5px; margin-right: 5px; border-left: solid 10px #0148A4; border-top: solid 2px #0148A4; border-bottom: solid 2px #0148A4; border-right: solid 2px #0148A4; box-shadow: 0 0 6px rgba(0,0,0,0.5); /* background-color: #e64626; */ color: #e64626; padding: 0.5em; -moz-border-radius: 5px; -webkit-border-radius: 5px; } .blockquote p { margin-top: 0px; margin-bottom: 5px; } .blockquote > h1:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h2:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h3:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h4:first-of-type { margin-top: 0px; margin-bottom: 5px; } .text-shadow { text-shadow: 0 0 4px #424242; } </style> <style type="text/css"> /****************** * Slide scrolling * (non-functional) * not sure if it is a good idea anyway slides > slide { overflow: scroll; padding: 5px 40px; } .scrollable-slide .remark-slide { height: 400px; overflow: scroll !important; } ******************/ .scroll-box-8 { height:8em; overflow-y: scroll; } .scroll-box-10 { height:10em; overflow-y: scroll; } .scroll-box-12 { height:12em; overflow-y: scroll; } .scroll-box-14 { height:14em; overflow-y: scroll; } .scroll-box-16 { height:16em; overflow-y: scroll; } .scroll-box-18 { height:18em; overflow-y: scroll; } .scroll-box-20 { height:20em; overflow-y: scroll; } .scroll-box-24 { height:24em; overflow-y: scroll; } .scroll-box-30 { height:30em; overflow-y: scroll; } .scroll-output { height: 90%; overflow-y: scroll; } </style> # What variables to include or not <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> You often + face the decision of whether you should be including a particular variable or not: <span style="color:red"> how do you make a right decision? </span> + miss a variable that you know is important because it is not simply available: <span style="color:red"> what are the consequences? </span> -- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> Two important concepts you need to be aware of: + Multicollinearity + Omitted Variable Bias --- # Multicollinearity and Omitted Variable Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> **Multicollinearity**: A phenomenon where two or more variables are highly correlated (negatively or positively) with each other (<span style="color:blue"> consequences? </span>) **Omitted Variable Bias**: Bias caused by not including (omitting) <span style="color:blue"> important </span> variables in the model --- # Multicollinearity and Omitted Variable Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> Consider the following model, `$$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$$` Your interest is in estimating the impact of `\(x_1\)` on `\(y\)`. -- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> ## Objectives: Using this simple model, we investigate what happens to the coefficient estimate on `\(x_1\)` if you include/omit `\(x_2\)` --- # Questions we tackle to answer <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> The model: `$$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$$` **Question 1**: What happens if `\(\beta_2=0\)`, but <span style="color:blue">include</span> `\(x_2\)` that is <span style="color:blue">not</span> correlated with `\(x_1\)`? **Question 2**: What happens if `\(\beta_2=0\)`, but <span style="color:blue">include</span> `\(x_2\)` that is <span style="color:blue">highly</span> correlated with `\(x_1\)`? **Question 3**: What happens if `\(\beta_2\ne 0\)`, but <span style="color:blue">omit</span> `\(x_2\)` that is <span style="color:blue">not</span> correlated with `\(x_1\)`? **Question 4**: What happens if `\(\beta_2\ne 0\)`, but <span style="color:blue">omit</span> `\(x_2\)` that is <span style="color:blue">highly</span> correlated with `\(x_1\)`? --- # Key consequences of interest <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> + Is `\(\hat{\beta_1}\)` unbiased, that is `\(E[\hat{\beta_1}]=\beta_1\)`? + `\(Var(\hat{\beta_1})\)`? (how accurate the estimation of `\(\hat{\beta_1}\)` is) --- class: inverse, center, middle name: case-2 # Case 1 <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> --- # Case 1 <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <span style="color:blue"> Example: </span> `\(\mbox{corn yield} = \beta_0 + \beta_1 \times N + \beta_2 \mbox{farmers' height} + u\)` -- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Two estimating equations (EE) </span> `\(EE_1\)`: `\(y_i=\beta_0 + \beta_1 x_{1,i} + v_i (\beta_2 x_{2,i} + u_i)\)` `\(EE_2\)`: `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` -- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> What do you think is gonna happen? Any guess? </span> + `\(E[\hat{\beta_1}]=\beta_1\)` in `\(EE_1\)`? (omitted variable bias?) + How does `\(Var(\hat{\beta_1})\)` in `\(EE_2\)` compared to its counterpart in `\(EE_1\)`? --- # Monte Carlo Simulation ```r #* load packages # library(fixest) # library(data.table) #-------------------------- # Monte Carlo Simulation #-------------------------- set.seed(37834) N <- 100 # sample size B <- 1000 # the number of iterations estiamtes_strage <- matrix(0, B, 2) for (i in 1:B) { # iterate the same process B times #--- data generation ---# x1 <- rnorm(N) # independent variable x2 <- rnorm(N) # independent variable u <- rnorm(N) # error y <- 1 + x1 + 0 * x2 + u # dependent variable data <- data.frame(y = y, x1 = x1, x2 = x2) #--- OLS ---# beta_ee1 <- feols(y ~ x1, data = data)$coefficient["x1"] # OLS with EE1 beta_ee2 <- feols(y ~ x1 + x2, data = data)$coefficient["x1"] # OLS with EE2 #--- store estimates ---# estiamtes_strage[i, 1] <- beta_ee1 estiamtes_strage[i, 2] <- beta_ee2 } #-------------------------- # Visualize the results #-------------------------- b_ee1 <- data.table( bhat = estiamtes_strage[, 1], type = "EE 1" ) b_ee2 <- data.table( bhat = estiamtes_strage[, 2], type = "EE 2" ) plot_data <- rbind(b_ee1, b_ee2) g_case_1 <- ggplot(data = plot_data) + geom_density(aes(x = bhat, fill = type), alpha = 0.5) + scale_fill_discrete(name = "Estimating Equation") + theme(legend.position = "bottom") ``` --- # MC Results ```r g_case_1 ``` <img src="data:image/png;base64,#OmittedMulticollinear_x_files/figure-html/unnamed-chunk-4-1.png" width="60%" style="display: block; margin: auto;" /> --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[v_i|x_{1,i}]=0?\)` --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[v_i|x_{1,i}]=0?\)` <span style="color:red"> Yes, because `\(x_1\)` is not correlated with either of `\(x_2\)` and `\(u\)`. </span> --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[u_i|x_{1,i},x_{2,i}]=0\)`? --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[u_i|x_{1,i},x_{2,i}]=0\)`? <span style="color:red"> Yes, because `\(x_1\)` and `\(x_2\)` are not correlated with `\(u\)` (by assumption). </span> --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? <span style="color:red"> 0 because there are no other variables included in the model.</span> --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? <span style="color:red"> 0 on average because `\(cor(x_1, x_2)=0\)` </span> --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Two models: </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> Which in `\(EE_1\)` and `\(EE_2\)` is `\(\sigma^2\)` larger? --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Two models: </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> Which in `\(EE_1\)` and `\(EE_2\)` is `\(\sigma^2\)` larger? <span style="color:red"> They are the same because `\(\beta_2 = 0\)`, meaning `\(u = v\)`. </span> --- class: middle # Summary + If you include an irrelevant variable that has no explanatory power beyond `\(x_1\)` and is not correlated with `\(x_1\)` (EE2), then the variance of the OLS estimator on `\(x_1\)` will be the same as when you do not include `\(x_2\)` as a covariate (EE1) + If you omit an irrelevant variable that has no explanatory power beyond `\(x_1\)` (EE1) and is not correlated with `\(x_1\)`, then the the OLS estimator on `\(x_1\)` is still unbiased --- class: inverse, center, middle name: case-2 # Case 2 <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> --- # Case 2 <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <span style="color:blue"> Example: </span> `\(\mbox{Income} = \beta_0 + \beta_1 \times Age + \beta_2 \times \mbox{# of wrinkles} + u\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Two estimating equations (EE) </span> `\(EE_1\)`: `\(y_i=\beta_0 + \beta_1 x_{1,i} + v_i (\beta_2 x_{2,i} + u_i)\)` `\(EE_2\)`: `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` -- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> What do you think is gonna happen? Any guess? </span> + `\(E[\hat{\beta_1}]=\beta_1\)` in `\(EE_1\)`? (omitted variable bias?) + How does `\(Var(\hat{\beta_1})\)` in `\(EE_2\)` compared to its counterpart in `\(EE_1\)`? --- # Monte Carlo Simulation ```r #-------------------------- # Monte Carlo Simulation #-------------------------- set.seed(37834) N <- 100 # sample size B <- 1000 # the number of iterations estiamtes_strage <- matrix(0, B, 2) for (i in 1:B) { # iterate the same process B times #--- data generation ---# mu <- rnorm(N) # common term shared by x1 and x2 x1 <- 0.1 * rnorm(N) + 0.9 * mu # independent variable x2 <- 0.1 * rnorm(N) + 0.9 * mu # independent variable u <- rnorm(N) # error y <- 1 + x1 + 0 * x2 + u # dependent variable data <- data.frame(y = y, x1 = x1, x2 = x2) #--- OLS ---# beta_ee1 <- feols(y ~ x1, data = data)$coefficient["x1"] # OLS with EE1 beta_ee2 <- feols(y ~ x1 + x2, data = data)$coefficient["x1"] # OLS with EE2 #--- store estimates ---# estiamtes_strage[i, 1] <- beta_ee1 estiamtes_strage[i, 2] <- beta_ee2 } #-------------------------- # Visualize the results #-------------------------- b_ee1 <- data.table( bhat = estiamtes_strage[, 1], type = "EE 1" ) b_ee2 <- data.table( bhat = estiamtes_strage[, 2], type = "EE 2" ) plot_data <- rbind(b_ee1, b_ee2) g_case_2 <- ggplot(data = plot_data) + geom_density(aes(x = bhat, fill = type), alpha = 0.5) + scale_fill_discrete(name = "Estimating Equation") + theme(legend.position = "bottom") ``` --- # MC Results ```r g_case_2 ``` <img src="data:image/png;base64,#OmittedMulticollinear_x_files/figure-html/unnamed-chunk-6-1.png" width="60%" style="display: block; margin: auto;" /> --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[v_i|x_{1,i}]=0?\)` --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[v_i|x_{1,i}]=0?\)` <span style="color:red"> Yes, because `\(\beta_2 = 0\)`, meaning that `\(x_2\)` is actually not part of the error term ($u$). </span> --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[u_i|x_{1,i},x_{2,i}]=0\)`? --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[u_i|x_{1,i},x_{2,i}]=0\)`? <span style="color:red"> Yes, because `\(x_1\)` and `\(x_2\)` are not correlated with `\(u\)` (by assumption). </span> --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? <span style="color:red"> 0 because there are no other variables included in the model.</span> --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? <span style="color:red"> Very high because `\(x_1\)` and `\(x_2\)` are highly correlated! </span> <span style="color:red"> So, the estimation accuracy of `\(\beta_1\)` in `\(EE_2\)` is much lower that in `\(EE_1\)`!.</span> --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Two models: </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> Which in `\(EE_1\)` and `\(EE_2\)` is `\(\sigma^2\)` larger? --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2=0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Two models: </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> Which in `\(EE_1\)` and `\(EE_2\)` is `\(\sigma^2\)` larger? <span style="color:red"> They are the same because `\(\beta_2 = 0\)`, meaning `\(u = v\)`. </span> --- class: middle # Summary + If you include an irrelevant variable that has no explanatory power beyond `\(x_1\)`, but is highly correlated with `\(x_1\)` (EE2), then the variance of the OLS estimator on `\(x_1\)` is larger compared to when you do not include `\(x_2\)` (EE1) + If you omit an irrelevant variable that has no explanatory power beyond `\(x_1\)` (EE1), but is highly correlated with `\(x_1\)`, then the the OLS estimator on `\(x_1\)` is still unbiased --- class: inverse, center, middle name: case-3 # Case 3 <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> --- # Case 3 <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <span style="color:blue"> Example: Randomized N trial</span> `\(\mbox{corn yield} = \beta_0 + \beta_1 \times N + \beta_2 \times \mbox{organic matter} + u\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Two estimating equations (EE) </span> `\(EE_1\)`: `\(y_i=\beta_0 + \beta_1 x_{1,i} + v_i (\beta_2 x_{2,i} + u_i)\)` `\(EE_2\)`: `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` -- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> What do you think is gonna happen? Any guess? </span> + `\(E[\hat{\beta_1}]=\beta_1\)` in `\(EE_1\)`? (omitted variable bias?) + How does `\(Var(\hat{\beta_1})\)` in `\(EE_2\)` compared to its counterpart in `\(EE_1\)`? --- # Monte Carlo Simulation ```r #-------------------------- # Monte Carlo Simulation #-------------------------- set.seed(37834) N <- 100 # sample size B <- 1000 # the number of iterations estiamtes_strage <- matrix(0, B, 2) for (i in 1:B) { # iterate the same process B times #--- data generation ---# x1 <- rnorm(N) # independent variable x2 <- rnorm(N) # independent variable u <- rnorm(N) # error y <- 1 + x1 + x2 + u # dependent variable data <- data.frame(y = y, x1 = x1, x2 = x2) #--- OLS ---# beta_ee1 <- feols(y ~ x1, data = data)$coefficient["x1"] # OLS with EE1 beta_ee2 <- feols(y ~ x1 + x2, data = data)$coefficient["x1"] # OLS with EE2 #--- store estimates ---# estiamtes_strage[i, 1] <- beta_ee1 estiamtes_strage[i, 2] <- beta_ee2 } #-------------------------- # Visualize the results #-------------------------- b_ee1 <- data.table( bhat = estiamtes_strage[, 1], type = "EE 1" ) b_ee2 <- data.table( bhat = estiamtes_strage[, 2], type = "EE 2" ) plot_data <- rbind(b_ee1, b_ee2) g_case_3 <- ggplot(data = plot_data) + geom_density(aes(x = bhat, fill = type), alpha = 0.5) + scale_fill_discrete(name = "Estimating Equation") + theme(legend.position = "bottom") ``` --- # MC Results ```r g_case_3 ``` <img src="data:image/png;base64,#OmittedMulticollinear_x_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" /> --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[v_i|x_{1,i}]=0?\)` --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[v_i|x_{1,i}]=0?\)` <span style="color:red"> Yes, because `\(x_1\)` and `\(x_2\)` are not correlated. </span> --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[u_i|x_{1,i},x_{2,i}]=0\)`? --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[u_i|x_{1,i},x_{2,i}]=0\)`? <span style="color:red"> Yes, because `\(x_1\)` and `\(x_2\)` are not correlated with `\(u\)` (by assumption). </span> --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? <span style="color:red"> 0 because there are no other variables included in the model.</span> --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? <span style="color:red"> Very high because `\(x_1\)` and `\(x_2\)` are highly correlated! </span> <span style="color:red"> 0 on average because `\(x_1\)` and `\(x_2\)` are note correlated. </span> --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Two models: </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> Which in `\(EE_1\)` and `\(EE_2\)` is `\(\sigma^2\)` larger? --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) = 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Two models: </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> Which in `\(EE_1\)` and `\(EE_2\)` is `\(\sigma^2\)` larger? <span style="color:red"> `\(Var(v_i) > Var(u_i)\)` because `\(\beta_2 x_{2}\)` (non-zero) is part of `\(v_i\)` on top of `\(u_i\)`.</span> <span style="color:red"> So, the estimation of `\(\beta_1\)` is more efficient in `\(EE_2\)` than in `\(EE_1\)`.</span> --- class: middle # Summary + If you include a variable that has some explanatory power beyond `\(x_1\)`, but is not correlated with `\(x_1\)` (EE2), then the variance of the OLS estimator on `\(x_1\)` is smaller compared to when you do not include `\(x_2\)` (EE1) + If you omit an variable that has some explanatory power beyond `\(x_1\)` (EE1), but is not correlated with `\(x_1\)`, then the the OLS estimator on `\(x_1\)` is still unbiased --- class: inverse, center, middle name: case-3 # Case 4 <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> --- # Case 4 <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <span style="color:blue"> Example</span> `\(\mbox{income} = \beta_0 + \beta_1 \times education + \beta_2 \times \mbox{ability} + u\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Two estimating equations (EE) </span> `\(EE_1\)`: `\(y_i=\beta_0 + \beta_1 x_{1,i} + v_i (\beta_2 x_{2,i} + u_i)\)` `\(EE_2\)`: `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` -- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> What do you think is gonna happen? Any guess? </span> + `\(E[\hat{\beta_1}]=\beta_1\)` in `\(EE_1\)`? (omitted variable bias?) + How does `\(Var(\hat{\beta_1})\)` in `\(EE_2\)` compared to its counterpart in `\(EE_1\)`? --- # Monte Carlo Simulation ```r #-------------------------- # Monte Carlo Simulation #-------------------------- set.seed(37834) N <- 100 # sample size B <- 1000 # the number of iterations estiamtes_strage <- matrix(0, B, 2) for (i in 1:B) { # iterate the same process B times #--- data generation ---# mu <- rnorm(N) # common term shared by x1 and x2 x1 <- 0.1 * rnorm(N) + 0.9 * mu # independent variable x2 <- 0.1 * rnorm(N) + 0.9 * mu # independent variable u <- rnorm(N) # error y <- 1 + x1 + 1 * x2 + u data <- data.frame(y = y, x1 = x1, x2 = x2) #--- OLS ---# beta_ee1 <- feols(y ~ x1, data = data)$coefficient["x1"] # OLS with EE1 beta_ee2 <- feols(y ~ x1 + x2, data = data)$coefficient["x1"] # OLS with EE2 #--- store estimates ---# estiamtes_strage[i, 1] <- beta_ee1 estiamtes_strage[i, 2] <- beta_ee2 } #-------------------------- # Visualize the results #-------------------------- b_ee1 <- data.table( bhat = estiamtes_strage[, 1], type = "EE 1" ) b_ee2 <- data.table( bhat = estiamtes_strage[, 2], type = "EE 2" ) plot_data <- rbind(b_ee1, b_ee2) g_case_4 <- ggplot(data = plot_data) + geom_density(aes(x = bhat, fill = type), alpha = 0.5) + scale_fill_discrete(name = "Estimating Equation") + theme(legend.position = "bottom") ``` --- # MC Results ```r g_case_4 ``` <img src="data:image/png;base64,#OmittedMulticollinear_x_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[v_i|x_{1,i}]=0?\)` --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[v_i|x_{1,i}]=0?\)` <span style="color:red"> No, because `\(x_1\)` and `\(x_2\)` are correlated. </span> <span style="color:red"> So, the estimation of `\(\beta_1\)` in `\(EE_1\)` is biased! </span> --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[u_i|x_{1,i},x_{2,i}]=0\)`? --- # Theoretical Insights: Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Question: </span> `\(E[u_i|x_{1,i},x_{2,i}]=0\)`? <span style="color:red"> Yes, because `\(x_1\)` and `\(x_2\)` are not correlated with `\(u\)` (by assumption). </span> --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? <span style="color:red"> 0 because there are no other variables included in the model.</span> --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> The estimated model </span> `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> `\(R_j^2\)`? <span style="color:red"> Very high because `\(x_1\)` and `\(x_2\)` are highly correlated! </span> --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Two models: </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> Which in `\(EE_1\)` and `\(EE_2\)` is `\(\sigma^2\)` larger? --- # Theoretical Insights: Variance of `\(\hat{\beta}_1\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> True Model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` + `\(cor(x_1,x_2) \ne 0\)` + `\(\beta_2 \ne 0\)` + `\(E[u_i|x_{1,i},x_{2,i}]=0\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Two models: </span> `\(EE_1\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` `\(EE_2\)`: `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:red"> Question: </span> Which in `\(EE_1\)` and `\(EE_2\)` is `\(\sigma^2\)` larger? <span style="color:red"> `\(Var(v_i) > Var(u_i)\)` because `\(\beta_2 x_{2}\)` (non-zero) is part of `\(v_i\)` on top of `\(u_i\)`.</span> --- # Estimation efficiency <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> <span style="color:blue"> Variance: </span> `\(Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}\)` where `\(R^2_j\)` is the `\(R^2\)` when you regress `\(x_j\)` on all the other covariates. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html> Summarizing the results about the components of `\(Var(\hat{\beta}_j)\)`, + `\(R_j^2\)` is very high for `\(EE_2\)` because `\(x_1\)` and `\(x_2\)` are highly correlated, while it is `\(0\)` for `\(EE_1\)`. + `\(Var(v_i) > Var(u_i)\)` because `\(\beta_2 x_{2}\)` (non-zero) is part of `\(v_i\)` on top of `\(u_i\)`. -- So, whether `\(EE_1\)` is more efficient than `\(EE_2\)` or not is ambiguous. It depends on + the degree of the correlation between `\(x_1\)` and `\(x_2\)` + the magnitude of `\(\beta_2\)` --- class: middle # Summary + There exists bias-variance trade-off when independent variables are both important (their coefficients are non-zero) and they are correlated + Economists tend to opt for unbiasedness --- # Omitted Variable Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> <span style="color:blue"> True model: </span> `\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)` -- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> <span style="color:blue"> EE1: </span> `\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)` Let `\(\tilde{\beta_1}\)` denote the estimator of `\(\beta_1\)` from this model -- <span style="color:blue"> EE2: </span> `\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)` Let `\(\hat{\beta_1}\)` and `\(\hat{\beta_2}\)` denote the estimator of `\(\beta_1\)` and `\(\beta_2\)` -- <span style="color:blue"> Relationship between `\(x_1\)` and `\(x_2\)` </span> `\(x_{1,i} = \sigma_0 + \sigma_1 x_{2,i} + \mu_{i}\)` Let `\(\tilde{\sigma_1}\)` denote the estimator of `\(\sigma_1\)` -- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> Then, `\(E[\tilde{\beta_1}] = \beta_1 + \beta_2 \tilde{\sigma_1}\)` where `\(\beta_2 \tilde{\sigma_1}\)` is the bias. --- # Magnitude and direction of bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> Then, `\(E[\tilde{\beta_1}] = \beta_1 + \beta_2 \tilde{\sigma_1}\)` where `\(\beta_2 \tilde{\sigma_1}\)` is the bias. -- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> <span style="color:blue"> Direction of bias </span> + `\(Cor(x_1, x_2) > 0\)` and `\(\beta_2 >0\)`, then `\(bias > 0\)` + `\(Cor(x_1, x_2) > 0\)` and `\(\beta_2 <0\)`, then `\(bias < 0\)` + `\(Cor(x_1, x_2) < 0\)` and `\(\beta_2 >0\)`, then `\(bias < 0\)` + `\(Cor(x_1, x_2) < 0\)` and `\(\beta_2 <0\)`, then `\(bias > 0\)` -- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> <span style="color:blue"> Magnitude of bias </span> + The greater the correlation between `\(x_1\)` and `\(x_2\)`, the greater the bias + The greater `\(\beta_1\)` is, the greater the bias --- # Direction of bias: Practice .content-box-green[**Example 1**] $$ `\begin{aligned} \mbox{corn yield} = \alpha + \beta \cdot N + (\gamma \cdot \mbox{soil erodability} + \mu) \end{aligned}` $$ + Famers tend to apply more nitrogen to the field that is more erodible to compensate for loss of nutrient due to erosion + Soil erodability affects corn yield negatively `\((\gamma < 0)\)` What is the direction of bias on `\(\hat{\beta}\)`? -- <br> .content-box-green[**Example 2**] $$ `\begin{aligned} \mbox{house price} = \alpha + \beta \cdot \mbox{dist to incinerators} + (\gamma \cdot \mbox{dist to city center} + \mu) \end{aligned}` $$ + The city planner placed incinerators in the outskirt of a city to avoid their potentially negative health effects + Distance to city center has a negative impact on house price `\((\gamma < 0)\)` What is the direction of bias on `\(\hat{\beta}\)`? -- <br> .content-box-green[**Example 3**] $$ `\begin{aligned} \mbox{groundwater use} = \alpha + \beta \cdot \mbox{precipitation} + (\gamma \cdot \mbox{center pivot} + \mu) \end{aligned}` $$ `\(\mbox{groundwater use}\)`: groundwater use by a farmer for irrigated production `\(\mbox{center pivot}\)`: 1 if center pivot is used, 0 if flood irrigation (less effective) is used + Farmers who have relatively low precipitation during the growing season tend to adopt center pivot more + center pivot applied water more efficiently than flood irrigation `\((\gamma < 0)\)` What is the direction of bias on `\(\hat{\beta}\)`? --- # So when does it help to know the direction of bias When the direction of the bias is the <span style = "color: red;"> opposite </span> of the expected coefficient on the variable of interest, you can claim that <span style = "color: blue;"> even after </span> suffering from the bias, you are still seeing the impact of the variable interest. So, it is a strong evidence that you would have had an even stronger estimated impact. -- .content-box-green[**Example 1**] $$ `\begin{aligned} \mbox{groundwater use} = \alpha + \beta \cdot \mbox{precipitation} + (\gamma \cdot \mbox{center pivot} + \mu) \end{aligned}` $$ + The true `\(\beta\)` is `\(-10\)` (<span style = "color: red;"> you do not observe this </span>) + The bias on `\(\hat{\beta}\)` is `\(5\)` (<span style = "color: red;"> you do not observe this </span>) + `\(\hat{\beta}\)` is `\(-5\)` (<span style = "color: red;"> you only observe this </span>) You believe the direction of bias is positive (you need provide reasoning behind your belief), and yet, the estimated coefficient is still negative. So, you can be quite confident that the sign of the imapct of precipitation is negative. You can say your estimate is a conservative estimate of the impact of precipitation on groudwater use. .content-box-green[**Example 2**] $$ `\begin{aligned} \mbox{house price} = \alpha + \beta \cdot \mbox{dist to incinerators} + (\gamma \cdot \mbox{dist to city center} + \mu) \end{aligned}` $$ + The true `\(\beta\)` is `\(-10\)` (<span style = "color: red;"> you do not observe this </span>) + The bias on `\(\hat{\beta}\)` is `\(-5\)` (<span style = "color: red;"> you do not observe this </span>) + `\(\hat{\beta}\)` is `\(-15\)` (<span style = "color: red;"> you only observe this </span>) You believe the direction of bias is negative, and the estimated coefficient is negative. So, unlike the case above, you cannot be confident that `\(\hat{\beta}\)` would have been negative if it were not for the bias (by observing dist to city center and include it as a covariate). It is very much possible that the degree of bias is so large that the estimated coefficient turns negative even though the true sign of `\(\beta\)` is positive. In this case, there is nothing you can do.