Omitted Variable Bias and Multicollinearity

# Omitted Variable Bias and Multicollinearity
### AECN 396/896-002

---

.remark-slide-content.hljs-github h1 {
  margin-top: 5px;  
  margin-bottom: 25px;  
}

.remark-slide-content.hljs-github {
  padding-top: 10px;  
  padding-left: 30px;  
  padding-right: 30px;  
}

.panel-tabs {
  
  color: #841F27;
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 0px;  
  padding-bottom: 0px;  
}

.panel-tab {
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 3px;  
  margin-right: 3px;  
  padding-top: 0px;  
  padding-bottom: 0px;  
}

.panelset .panel-tabs .panel-tab {
  min-height: 40px;
}

.remark-slide th {
  border-bottom: 1px solid #ddd;
}

.remark-slide thead {
  border-bottom: 0px;
}

.gt_footnote {
  padding: 2px;  
}

.remark-slide table {
  border-collapse: collapse;
}

.remark-slide tbody {
  border-bottom: 2px solid #666;
}

.important {
  background-color: lightpink;
  border: 2px solid blue;
  font-weight: bold;
}

.remark-code {
  display: block;
  overflow-x: auto;
  padding: .5em;
  background: #ffe7e7;
}

.hljs-github .hljs {
  background: #f2f2fd;
}

.remark-inline-code {
  padding-top: 0px;
  padding-bottom: 0px;
  background-color: #e6e6e6;
}

.r.hljs.remark-code.remark-inline-code{
  font-size: 0.9em
}

.left-full {
  width: 80%;
  height: 92%;
  float: left;
}

.left-code {
  width: 38%;
  height: 92%;
  float: left;
}

.right-plot {
  width: 60%;
  float: right;
  padding-left: 1%;
}

.left5 {
  width: 49%;
  height: 92%;
  float: left;
}

.right5 {
  width: 49%;
  float: right;
  padding-left: 1%;
}

.left3 {
  width: 29%;
  height: 92%;
  float: left;
}

.right7 {
  width: 69%;
  float: right;
  padding-left: 1%;
}

.left4 {
  width: 38%;
  height: 92%;
  float: left;
}

.right6 {
  width: 60%;
  float: right;
  padding-left: 1%;
}

ul li{
  margin: 7px;
}

ul, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

ol li{
  margin: 7px;
}

ol, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

</style>

.full-width {
    display: flex;
    width: 100%;
    flex: 1 1 auto;
}
</style>

.blockquote p {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h1:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h2:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h3:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h4:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}

.text-shadow {
  text-shadow: 0 0 4px #424242;
}
</style>

.scroll-box-8 {
  height:8em;
  overflow-y: scroll;
}
.scroll-box-10 {
  height:10em;
  overflow-y: scroll;
}
.scroll-box-12 {
  height:12em;
  overflow-y: scroll;
}
.scroll-box-14 {
  height:14em;
  overflow-y: scroll;
}
.scroll-box-16 {
  height:16em;
  overflow-y: scroll;
}
.scroll-box-18 {
  height:18em;
  overflow-y: scroll;
}
.scroll-box-20 {
  height:20em;
  overflow-y: scroll;
}
.scroll-box-24 {
  height:24em;
  overflow-y: scroll;
}
.scroll-box-30 {
  height:30em;
  overflow-y: scroll;
}
.scroll-output {
  height: 90%;
  overflow-y: scroll;
}

</style>

# What variables to include or not

You often

+ face the decision of whether you should be including a particular variable or not: <span style="color:red"> how do you make a right decision? </span>

+ miss a variable that you know is important because it is not simply available: <span style="color:red"> what are the consequences? </span>

Two important concepts you need to be aware of:

+ Multicollinearity
+ Omitted Variable Bias

---

# Multicollinearity and Omitted Variable Bias

<html><div style='float:left'></div><hr color='#EB811B' size=1px width=440px></html>
  
**Multicollinearity**:

A phenomenon where two or more variables are highly correlated (negatively or positively) with each other (<span style="color:blue"> consequences? </span>)

**Omitted Variable Bias**:

Bias caused by not including (omitting) <span style="color:blue"> important </span> variables in the model

---

# Multicollinearity and Omitted Variable Bias

Consider the following model,

`$$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$$`

Your interest is in estimating the impact of `$x_1$` on `$y$`.

## Objectives:

Using this simple model, we investigate what happens to the coefficient estimate on `$x_1$` if you include/omit `$x_2$`

---

# Questions we tackle to answer

The model: `$$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$$`

**Question 1**:

What happens if `$\beta_2=0$`, but <span style="color:blue">include</span> `$x_2$` that is <span style="color:blue">not</span> correlated with `$x_1$`?

**Question 2**:

What happens if `$\beta_2=0$`, but <span style="color:blue">include</span> `$x_2$` that is <span style="color:blue">highly</span> correlated with `$x_1$`?

**Question 3**:

What happens if `$\beta_2\ne 0$`, but <span style="color:blue">omit</span> `$x_2$` that is <span style="color:blue">not</span> correlated with `$x_1$`?

**Question 4**:

What happens if `$\beta_2\ne 0$`, but <span style="color:blue">omit</span> `$x_2$` that is <span style="color:blue">highly</span> correlated with `$x_1$`?

---

# Key consequences of interest

+ Is `$\hat{\beta_1}$` unbiased, that is `$E[\hat{\beta_1}]=\beta_1$`?

+ `$Var(\hat{\beta_1})$`? (how accurate the estimation of `$\hat{\beta_1}$` is)

---

# Case 1

---

# Case 1

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Example: </span>

`$\mbox{corn yield} = \beta_0 + \beta_1 \times N + \beta_2 \mbox{farmers' height} + u$`

<span style="color:blue"> Two estimating equations (EE) </span>

`$EE_1$`: `$y_i=\beta_0 + \beta_1 x_{1,i} + v_i (\beta_2 x_{2,i} + u_i)$`

`$EE_2$`: `$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

<span style="color:blue"> What do you think is gonna happen? Any guess? </span>

+ `$E[\hat{\beta_1}]=\beta_1$` in `$EE_1$`? (omitted variable bias?)

+ How does `$Var(\hat{\beta_1})$` in `$EE_2$` compared to its counterpart in `$EE_1$`?

---

# Monte Carlo Simulation

```r
#* load packages
# library(fixest)
# library(data.table)

#--------------------------
# Monte Carlo Simulation
#--------------------------
set.seed(37834)

N <- 100 # sample size
B <- 1000 # the number of iterations
estiamtes_strage <- matrix(0, B, 2)

for (i in 1:B) { # iterate the same process B times

#--- data generation ---#
  x1 <- rnorm(N) # independent variable
  x2 <- rnorm(N) # independent variable
  u <- rnorm(N) # error
  y <- 1 + x1 + 0 * x2 + u # dependent variable
  data <- data.frame(y = y, x1 = x1, x2 = x2)

#--- OLS ---#
  beta_ee1 <- feols(y ~ x1, data = data)$coefficient["x1"] # OLS with EE1
  beta_ee2 <- feols(y ~ x1 + x2, data = data)$coefficient["x1"] # OLS with EE2

#--- store estimates ---#
  estiamtes_strage[i, 1] <- beta_ee1
  estiamtes_strage[i, 2] <- beta_ee2
}

#--------------------------
# Visualize the results
#--------------------------
b_ee1 <- data.table(
  bhat = estiamtes_strage[, 1],
  type = "EE 1"
)

b_ee2 <- data.table(
  bhat = estiamtes_strage[, 2],
  type = "EE 2"
)

plot_data <- rbind(b_ee1, b_ee2)

g_case_1 <- ggplot(data = plot_data) +
  geom_density(aes(x = bhat, fill = type), alpha = 0.5) +
  scale_fill_discrete(name = "Estimating Equation") +
  theme(legend.position = "bottom")
```
 
---

# MC Results

```r
g_case_1
```

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:blue"> Question: </span>

`$E[v_i|x_{1,i}]=0?$`

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:blue"> Question: </span>

`$E[v_i|x_{1,i}]=0?$`

<span style="color:red"> Yes, because `$x_1$` is not correlated with either of `$x_2$` and `$u$`. </span>

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:blue"> Question: </span>

`$E[u_i|x_{1,i},x_{2,i}]=0$`?

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:blue"> Question: </span>

`$E[u_i|x_{1,i},x_{2,i}]=0$`?

<span style="color:red"> Yes, because `$x_1$` and `$x_2$` are not correlated with `$u$` (by assumption). </span>

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

<span style="color:red"> 0 because there are no other variables included in the model.</span>

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

<span style="color:red"> 0 on average because `$cor(x_1, x_2)=0$` </span>

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> Two models: </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

Which in `$EE_1$` and `$EE_2$` is `$\sigma^2$` larger?

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> Two models: </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

Which in `$EE_1$` and `$EE_2$` is `$\sigma^2$` larger?

<span style="color:red"> They are the same because `$\beta_2 = 0$`, meaning `$u = v$`. </span>

---
class: middle

# Summary

+ If you include an irrelevant variable that has no explanatory power beyond `$x_1$` and is not correlated with `$x_1$` (EE2), then the variance of the OLS estimator on `$x_1$` will be the same as when you do not include `$x_2$` as a covariate (EE1)

+ If you omit an irrelevant variable that has no explanatory power beyond `$x_1$` (EE1) and is not correlated with `$x_1$`, then the the OLS estimator on `$x_1$` is still unbiased

---

# Case 2

---

# Case 2

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Example: </span>

`$\mbox{Income} = \beta_0 + \beta_1 \times Age + \beta_2 \times \mbox{# of wrinkles} + u$`

<span style="color:blue"> Two estimating equations (EE) </span>

`$EE_1$`: `$y_i=\beta_0 + \beta_1 x_{1,i} + v_i (\beta_2 x_{2,i} + u_i)$`

`$EE_2$`: `$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

<span style="color:blue"> What do you think is gonna happen? Any guess? </span>

+ `$E[\hat{\beta_1}]=\beta_1$` in `$EE_1$`? (omitted variable bias?)

+ How does `$Var(\hat{\beta_1})$` in `$EE_2$` compared to its counterpart in `$EE_1$`?

---

# Monte Carlo Simulation

```r
#--------------------------
# Monte Carlo Simulation
#--------------------------
set.seed(37834)

N <- 100 # sample size
B <- 1000 # the number of iterations
estiamtes_strage <- matrix(0, B, 2)

for (i in 1:B) { # iterate the same process B times

#--- data generation ---#
  mu <- rnorm(N) # common term shared by x1 and x2
  x1 <- 0.1 * rnorm(N) + 0.9 * mu # independent variable
  x2 <- 0.1 * rnorm(N) + 0.9 * mu # independent variable
  u <- rnorm(N) # error
  y <- 1 + x1 + 0 * x2 + u # dependent variable
  data <- data.frame(y = y, x1 = x1, x2 = x2)

#--- OLS ---#
  beta_ee1 <- feols(y ~ x1, data = data)$coefficient["x1"] # OLS with EE1
  beta_ee2 <- feols(y ~ x1 + x2, data = data)$coefficient["x1"] # OLS with EE2

#--- store estimates ---#
  estiamtes_strage[i, 1] <- beta_ee1
  estiamtes_strage[i, 2] <- beta_ee2
}

#--------------------------
# Visualize the results
#--------------------------
b_ee1 <- data.table(
  bhat = estiamtes_strage[, 1],
  type = "EE 1"
)

b_ee2 <- data.table(
  bhat = estiamtes_strage[, 2],
  type = "EE 2"
)

plot_data <- rbind(b_ee1, b_ee2)

g_case_2 <- ggplot(data = plot_data) +
  geom_density(aes(x = bhat, fill = type), alpha = 0.5) +
  scale_fill_discrete(name = "Estimating Equation") +
  theme(legend.position = "bottom")
```
 
---

# MC Results

```r
g_case_2
```

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:blue"> Question: </span>

`$E[v_i|x_{1,i}]=0?$`

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:blue"> Question: </span>

`$E[v_i|x_{1,i}]=0?$`

<span style="color:red"> Yes, because `$\beta_2 = 0$`, meaning that `$x_2$` is actually not part of the error term ($u$). </span>

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:blue"> Question: </span>

`$E[u_i|x_{1,i},x_{2,i}]=0$`?

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:blue"> Question: </span>

`$E[u_i|x_{1,i},x_{2,i}]=0$`?

<span style="color:red"> Yes, because `$x_1$` and `$x_2$` are not correlated with `$u$` (by assumption). </span>

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

<span style="color:red"> 0 because there are no other variables included in the model.</span>

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

<span style="color:red"> Very high because `$x_1$` and `$x_2$` are highly correlated! </span>

<span style="color:red"> So, the estimation accuracy of `$\beta_1$` in `$EE_2$` is much lower that in `$EE_1$`!.</span>

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> Two models: </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

Which in `$EE_1$` and `$EE_2$` is `$\sigma^2$` larger?

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2=0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> Two models: </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

Which in `$EE_1$` and `$EE_2$` is `$\sigma^2$` larger?

<span style="color:red"> They are the same because `$\beta_2 = 0$`, meaning `$u = v$`. </span>

---
class: middle

# Summary

+ If you include an irrelevant variable that has no explanatory power beyond `$x_1$`, but is highly correlated with `$x_1$` (EE2), then the variance of the OLS estimator on `$x_1$` is larger compared to when you do not include `$x_2$` (EE1)

+ If you omit an irrelevant variable that has no explanatory power beyond `$x_1$` (EE1), but is highly correlated with `$x_1$`, then the the OLS estimator on `$x_1$` is still unbiased

---

# Case 3

---

# Case 3

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Example: Randomized N trial</span>

`$\mbox{corn yield} = \beta_0 + \beta_1 \times N + \beta_2 \times \mbox{organic matter} + u$`

<span style="color:blue"> Two estimating equations (EE) </span>

`$EE_1$`: `$y_i=\beta_0 + \beta_1 x_{1,i} + v_i (\beta_2 x_{2,i} + u_i)$`

`$EE_2$`: `$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

<span style="color:blue"> What do you think is gonna happen? Any guess? </span>

+ `$E[\hat{\beta_1}]=\beta_1$` in `$EE_1$`? (omitted variable bias?)

+ How does `$Var(\hat{\beta_1})$` in `$EE_2$` compared to its counterpart in `$EE_1$`?

---

# Monte Carlo Simulation

```r
#--------------------------
# Monte Carlo Simulation
#--------------------------
set.seed(37834)

N <- 100 # sample size
B <- 1000 # the number of iterations
estiamtes_strage <- matrix(0, B, 2)

for (i in 1:B) { # iterate the same process B times

#--- data generation ---#
  x1 <- rnorm(N) # independent variable
  x2 <- rnorm(N) # independent variable
  u <- rnorm(N) # error
  y <- 1 + x1 + x2 + u # dependent variable
  data <- data.frame(y = y, x1 = x1, x2 = x2)

#--- OLS ---#
  beta_ee1 <- feols(y ~ x1, data = data)$coefficient["x1"] # OLS with EE1
  beta_ee2 <- feols(y ~ x1 + x2, data = data)$coefficient["x1"] # OLS with EE2

#--- store estimates ---#
  estiamtes_strage[i, 1] <- beta_ee1
  estiamtes_strage[i, 2] <- beta_ee2
}

#--------------------------
# Visualize the results
#--------------------------
b_ee1 <- data.table(
  bhat = estiamtes_strage[, 1],
  type = "EE 1"
)

b_ee2 <- data.table(
  bhat = estiamtes_strage[, 2],
  type = "EE 2"
)

plot_data <- rbind(b_ee1, b_ee2)

g_case_3 <- ggplot(data = plot_data) +
  geom_density(aes(x = bhat, fill = type), alpha = 0.5) +
  scale_fill_discrete(name = "Estimating Equation") +
  theme(legend.position = "bottom")
```
 
---

# MC Results

```r
g_case_3
```

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:blue"> Question: </span>

`$E[v_i|x_{1,i}]=0?$`

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:blue"> Question: </span>

`$E[v_i|x_{1,i}]=0?$`

<span style="color:red"> Yes, because `$x_1$` and `$x_2$` are not correlated. </span>

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:blue"> Question: </span>

`$E[u_i|x_{1,i},x_{2,i}]=0$`?

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:blue"> Question: </span>

`$E[u_i|x_{1,i},x_{2,i}]=0$`?

<span style="color:red"> Yes, because `$x_1$` and `$x_2$` are not correlated with `$u$` (by assumption). </span>

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

<span style="color:red"> 0 because there are no other variables included in the model.</span>

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

<span style="color:red"> Very high because `$x_1$` and `$x_2$` are highly correlated! </span>

<span style="color:red"> 0 on average because `$x_1$` and `$x_2$` are note correlated. </span>

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> Two models: </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

Which in `$EE_1$` and `$EE_2$` is `$\sigma^2$` larger?

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) = 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> Two models: </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

Which in `$EE_1$` and `$EE_2$` is `$\sigma^2$` larger?

<span style="color:red"> `$Var(v_i) > Var(u_i)$` because `$\beta_2 x_{2}$` (non-zero) is part of `$v_i$` on top of `$u_i$`.</span>

<span style="color:red"> So, the estimation of `$\beta_1$` is more efficient in `$EE_2$` than in `$EE_1$`.</span>
---
class: middle

# Summary

+ If you include a variable that has some explanatory power beyond `$x_1$`, but is not correlated with `$x_1$` (EE2), then the variance of the OLS estimator on `$x_1$` is smaller compared to when you do not include `$x_2$` (EE1)

+ If you omit an variable that has some explanatory power beyond `$x_1$` (EE1), but is not correlated with `$x_1$`, then the the OLS estimator on `$x_1$` is still unbiased

---

# Case 4

---

# Case 4

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Example</span>

`$\mbox{income} = \beta_0 + \beta_1 \times education + \beta_2 \times \mbox{ability} + u$`

<span style="color:blue"> Two estimating equations (EE) </span>

`$EE_1$`: `$y_i=\beta_0 + \beta_1 x_{1,i} + v_i (\beta_2 x_{2,i} + u_i)$`

`$EE_2$`: `$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

<span style="color:blue"> What do you think is gonna happen? Any guess? </span>

+ `$E[\hat{\beta_1}]=\beta_1$` in `$EE_1$`? (omitted variable bias?)

+ How does `$Var(\hat{\beta_1})$` in `$EE_2$` compared to its counterpart in `$EE_1$`?

---

# Monte Carlo Simulation

```r
#--------------------------
# Monte Carlo Simulation
#--------------------------
set.seed(37834)

N <- 100 # sample size
B <- 1000 # the number of iterations
estiamtes_strage <- matrix(0, B, 2)

for (i in 1:B) { # iterate the same process B times

#--- data generation ---#
  mu <- rnorm(N) # common term shared by x1 and x2
  x1 <- 0.1 * rnorm(N) + 0.9 * mu # independent variable
  x2 <- 0.1 * rnorm(N) + 0.9 * mu # independent variable
  u <- rnorm(N) # error
  y <- 1 + x1 + 1 * x2 + u
  data <- data.frame(y = y, x1 = x1, x2 = x2)

#--- OLS ---#
  beta_ee1 <- feols(y ~ x1, data = data)$coefficient["x1"] # OLS with EE1
  beta_ee2 <- feols(y ~ x1 + x2, data = data)$coefficient["x1"] # OLS with EE2

#--- store estimates ---#
  estiamtes_strage[i, 1] <- beta_ee1
  estiamtes_strage[i, 2] <- beta_ee2
}

#--------------------------
# Visualize the results
#--------------------------
b_ee1 <- data.table(
  bhat = estiamtes_strage[, 1],
  type = "EE 1"
)

b_ee2 <- data.table(
  bhat = estiamtes_strage[, 2],
  type = "EE 2"
)

plot_data <- rbind(b_ee1, b_ee2)

g_case_4 <- ggplot(data = plot_data) +
  geom_density(aes(x = bhat, fill = type), alpha = 0.5) +
  scale_fill_discrete(name = "Estimating Equation") +
  theme(legend.position = "bottom")
```
 
---

# MC Results

```r
g_case_4
```

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:blue"> Question: </span>

`$E[v_i|x_{1,i}]=0?$`

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:blue"> Question: </span>

`$E[v_i|x_{1,i}]=0?$`

<span style="color:red"> No, because `$x_1$` and `$x_2$` are  correlated. </span>

<span style="color:red"> So, the estimation of `$\beta_1$` in `$EE_1$` is biased! </span>

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:blue"> Question: </span>

`$E[u_i|x_{1,i},x_{2,i}]=0$`?

---

# Theoretical Insights: Bias

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:blue"> Question: </span>

`$E[u_i|x_{1,i},x_{2,i}]=0$`?

<span style="color:red"> Yes, because `$x_1$` and `$x_2$` are not correlated with `$u$` (by assumption). </span>

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

<span style="color:red"> 0 because there are no other variables included in the model.</span>

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> The estimated model </span>

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

`$R_j^2$`?

<span style="color:red"> Very high because `$x_1$` and `$x_2$` are highly correlated! </span>

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> Two models: </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

Which in `$EE_1$` and `$EE_2$` is `$\sigma^2$` larger?

---

# Theoretical Insights: Variance of `$\hat{\beta}_1$`

<span style="color:blue"> True Model: </span>

`$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

+ `$cor(x_1,x_2) \ne 0$`
+ `$\beta_2 \ne 0$`
+ `$E[u_i|x_{1,i},x_{2,i}]=0$`

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

<span style="color:blue"> Two models: </span>

`$EE_1$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

`$EE_2$`: `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

<span style="color:red"> Question: </span>

Which in `$EE_1$` and `$EE_2$` is `$\sigma^2$` larger?

<span style="color:red"> `$Var(v_i) > Var(u_i)$` because `$\beta_2 x_{2}$` (non-zero) is part of `$v_i$` on top of `$u_i$`.</span>

---

# Estimation efficiency

<span style="color:blue"> Variance: </span>

`$Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}$`

where `$R^2_j$` is the `$R^2$` when you regress `$x_j$` on all the other covariates.

Summarizing the results about the components of `$Var(\hat{\beta}_j)$`,

+ `$R_j^2$` is very high for `$EE_2$` because `$x_1$` and `$x_2$` are highly correlated, while it is `$0$` for `$EE_1$`.

+ `$Var(v_i) > Var(u_i)$` because `$\beta_2 x_{2}$` (non-zero) is part of `$v_i$` on top of `$u_i$`.

So, whether `$EE_1$` is more efficient than `$EE_2$` or not is ambiguous. It depends on

+ the degree of the correlation between `$x_1$` and `$x_2$`
+ the magnitude of `$\beta_2$`

---
class: middle

# Summary

+ There exists bias-variance trade-off when independent variables are both important (their coefficients are non-zero) and they are correlated

+ Economists tend to opt for unbiasedness

---

# Omitted Variable Bias

<span style="color:blue"> True model:

</span> `$y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i$`

<span style="color:blue"> EE1:

</span> `$y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})$`

Let `$\tilde{\beta_1}$` denote the estimator of `$\beta_1$` from this model

<span style="color:blue"> EE2:

</span> `$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}$`

Let `$\hat{\beta_1}$` and `$\hat{\beta_2}$` denote the estimator of `$\beta_1$` and `$\beta_2$`

<span style="color:blue"> Relationship between `$x_1$` and `$x_2$` </span>

`$x_{1,i} = \sigma_0 + \sigma_1 x_{2,i} + \mu_{i}$`

Let `$\tilde{\sigma_1}$` denote the estimator of `$\sigma_1$`

Then,

`$E[\tilde{\beta_1}] = \beta_1 + \beta_2 \tilde{\sigma_1}$`

where `$\beta_2 \tilde{\sigma_1}$` is the bias.

---

# Magnitude and direction of bias

Then,

`$E[\tilde{\beta_1}] = \beta_1 + \beta_2 \tilde{\sigma_1}$`

where `$\beta_2 \tilde{\sigma_1}$` is the bias.

<span style="color:blue"> Direction of bias </span>

+ `$Cor(x_1, x_2) > 0$` and `$\beta_2 >0$`, then `$bias > 0$`
+ `$Cor(x_1, x_2) > 0$` and `$\beta_2 <0$`, then `$bias < 0$`
+ `$Cor(x_1, x_2) < 0$` and `$\beta_2 >0$`, then `$bias < 0$`
+ `$Cor(x_1, x_2) < 0$` and `$\beta_2 <0$`, then `$bias > 0$`

<span style="color:blue"> Magnitude of bias </span>

+ The greater the correlation between `$x_1$` and `$x_2$`, the greater the bias

+ The greater `$\beta_1$` is, the greater the bias

---

# Direction of bias: Practice

$$
`\begin{aligned}
\mbox{corn yield} = \alpha + \beta \cdot N + (\gamma \cdot \mbox{soil erodability}  + \mu)
\end{aligned}`
$$

+ Famers tend to apply more nitrogen to the field that is more erodible to compensate for loss of nutrient due to erosion
+ Soil erodability affects corn yield negatively `$(\gamma < 0)$`

What is the direction of bias on `$\hat{\beta}$`?

<br>

$$
`\begin{aligned}
\mbox{house price} = \alpha + \beta \cdot \mbox{dist to incinerators} + (\gamma \cdot \mbox{dist to city center}  + \mu)
\end{aligned}`
$$

+ The city planner placed incinerators in the outskirt of a city to avoid their potentially negative health effects
+ Distance to city center has a negative impact on house price `$(\gamma < 0)$`

What is the direction of bias on `$\hat{\beta}$`?

<br>

$$
`\begin{aligned}
\mbox{groundwater use} = \alpha + \beta \cdot \mbox{precipitation} + (\gamma \cdot \mbox{center pivot}  + \mu)
\end{aligned}`
$$

`$\mbox{groundwater use}$`: groundwater use by a farmer for irrigated production

`$\mbox{center pivot}$`: 1 if center pivot is used, 0 if flood irrigation (less effective) is used

+ Farmers who have relatively low precipitation during the growing season tend to adopt center pivot more
+ center pivot applied water more efficiently than flood irrigation `$(\gamma < 0)$`

What is the direction of bias on `$\hat{\beta}$`?

---

# So when does it help to know the direction of bias

When the direction of the bias is the <span style = "color: red;"> opposite </span> of the expected coefficient on the variable of interest, you can claim that <span style = "color: blue;"> even after </span> suffering from the bias, you are still seeing the impact of the variable interest. So, it is a strong evidence that you would have had an even stronger estimated impact.

$$
`\begin{aligned}
\mbox{groundwater use} = \alpha + \beta \cdot \mbox{precipitation} + (\gamma \cdot \mbox{center pivot}  + \mu)
\end{aligned}`
$$

+ The true `$\beta$` is `$-10$` (<span style = "color: red;"> you do not observe this </span>)
+ The bias on `$\hat{\beta}$` is `$5$` (<span style = "color: red;"> you do not observe this </span>) 
+ `$\hat{\beta}$` is `$-5$` (<span style = "color: red;"> you only observe this </span>)

You believe the direction of bias is positive (you need provide reasoning behind your belief), and yet, the estimated coefficient is still negative. So, you can be quite confident that the sign of the imapct of precipitation is negative. You can say your estimate is a conservative estimate of the impact of precipitation on groudwater use.

$$
`\begin{aligned}
\mbox{house price} = \alpha + \beta \cdot \mbox{dist to incinerators} + (\gamma \cdot \mbox{dist to city center}  + \mu)
\end{aligned}`
$$

+ The true `$\beta$` is `$-10$` (<span style = "color: red;"> you do not observe this </span>)
+ The bias on `$\hat{\beta}$` is `$-5$` (<span style = "color: red;"> you do not observe this </span>) 
+ `$\hat{\beta}$` is `$-15$` (<span style = "color: red;"> you only observe this </span>)

You believe the direction of bias is negative, and the estimated coefficient is negative. So, unlike the case above, you cannot be confident that `$\hat{\beta}$` would have been negative if it were not for the bias (by observing dist to city center and include it as a covariate). It is very much possible that the degree of bias is so large that the estimated coefficient turns negative even though the true sign of `$\beta$` is positive. In this case, there is nothing you can do.