Dealing with Endogeneity: Instrumental Variable

class: center, middle, inverse, title-slide

# Dealing with Endogeneity: Instrumental Variable
### AECN 396/896-002

---

.remark-slide-content.hljs-github h1 {
  margin-top: 5px;  
  margin-bottom: 25px;  
}

.remark-slide-content.hljs-github {
  padding-top: 10px;  
  padding-left: 30px;  
  padding-right: 30px;  
}

.panel-tabs {
  
  color: #841F27;
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 0px;  
  padding-bottom: 0px;  
}

.panel-tab {
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 3px;  
  margin-right: 3px;  
  padding-top: 0px;  
  padding-bottom: 0px;  
}

.panelset .panel-tabs .panel-tab {
  min-height: 40px;
}

.remark-slide th {
  border-bottom: 1px solid #ddd;
}

.remark-slide thead {
  border-bottom: 0px;
}

.gt_footnote {
  padding: 2px;  
}

.remark-slide table {
  border-collapse: collapse;
}

.remark-slide tbody {
  border-bottom: 2px solid #666;
}

.important {
  background-color: lightpink;
  border: 2px solid blue;
  font-weight: bold;
}

.remark-code {
  display: block;
  overflow-x: auto;
  padding: .5em;
  background: #ffe7e7;
}

.hljs-github .hljs {
  background: #f2f2fd;
}

.remark-inline-code {
  padding-top: 0px;
  padding-bottom: 0px;
  background-color: #e6e6e6;
}

.r.hljs.remark-code.remark-inline-code{
  font-size: 0.9em
}

.left-full {
  width: 80%;
  height: 92%;
  float: left;
}

.left-code {
  width: 38%;
  height: 92%;
  float: left;
}

.right-plot {
  width: 60%;
  float: right;
  padding-left: 1%;
}

.left5 {
  width: 49%;
  height: 92%;
  float: left;
}

.right5 {
  width: 49%;
  float: right;
  padding-left: 1%;
}

.left3 {
  width: 29%;
  height: 92%;
  float: left;
}

.right7 {
  width: 69%;
  float: right;
  padding-left: 1%;
}

.left4 {
  width: 38%;
  height: 92%;
  float: left;
}

.right6 {
  width: 60%;
  float: right;
  padding-left: 1%;
}

ul li{
  margin: 7px;
}

ul, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

ol li{
  margin: 7px;
}

ol, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

</style>

.full-width {
    display: flex;
    width: 100%;
    flex: 1 1 auto;
}
</style>

.blockquote p {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h1:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h2:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h3:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h4:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}

.text-shadow {
  text-shadow: 0 0 4px #424242;
}
</style>

.scroll-box-8 {
  height:8em;
  overflow-y: scroll;
}
.scroll-box-10 {
  height:10em;
  overflow-y: scroll;
}
.scroll-box-12 {
  height:12em;
  overflow-y: scroll;
}
.scroll-box-14 {
  height:14em;
  overflow-y: scroll;
}
.scroll-box-16 {
  height:16em;
  overflow-y: scroll;
}
.scroll-box-18 {
  height:18em;
  overflow-y: scroll;
}
.scroll-box-20 {
  height:20em;
  overflow-y: scroll;
}
.scroll-box-24 {
  height:24em;
  overflow-y: scroll;
}
.scroll-box-30 {
  height:30em;
  overflow-y: scroll;
}
.scroll-output {
  height: 90%;
  overflow-y: scroll;
}

</style>

# Before we start

## Learning objectives

Understand how instrumental variable (IV) estimation works.

## Table of contents

1. [Instrumental Variable (IV) Approach](#inst)
2. [IV in R](#iv-r)

---
class: middle

# Endogeneity

.content-box-red[**Endogeneity**]
  
`$E[u|x_k] \ne 0$` (the error term is not correlated with any of the independent variables)

.content-box-red[**Endogenous independent variable**]

If the error term is, .red[for whatever reason], correlated with the independent variable `$x_k$`, then we say that `$x_k$` is an endogenous independent variable.

+ Omitted variable 
+ Selection 
+ Reverse causality 
+ Measurement error

---
class: inverse, center, middle
name: inst

# Instrumental Variable (IV) Approach

---
class: middle

# Causal Diagram

You want to estimate the causal impact of education on income.

+ Variable of interest: Education
+ Dependent variable: Income

<div id="htmlwidget-d3d5806018d3e6f10bd5" style="width:60%;height:216px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-d3d5806018d3e6f10bd5">{"x":{"diagram":"\ndigraph {\n  graph [ranksep = 0.2]\n  node [shape = plaintext]\n    A [label = \"Education\"]\n    Y [label = \"Income\"]\n    C [label = \"Error (Ability + others)\"]\n    E [label = \"Experience\"]\n  edge [minlen = 2]\n    A->Y\n    E->Y\n    C->A\n    C->Y\n  { rank = same; A; Y }\n}\n","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

---
class: middle

# Rough Idea of IV Approach

Find a variable like `$Z$` in the diagram below:
<div id="htmlwidget-4bbc175bc7b8797a39b8" style="width:60%;height:216px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-4bbc175bc7b8797a39b8">{"x":{"diagram":"\ndigraph {\n  graph [ranksep = 0.2]\n  node [shape = plaintext]\n    Z [label = \"Z\"]\n    A [label = \"Education\"]\n    Y [label = \"Income\"]\n    C [label = \"Error (Ability + others)\"]\n    E [label = \"Experience\"]\n  edge [minlen = 2]\n    A->Y\n    E->Y\n    C->A\n    C->Y\n    Z->A\n  { rank = same; Z; A; Y}\n}\n","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

+ `$Z$` does <span style = "color: blue;">  NOT </span> affect income <span style = "color: blue;">  directly </span>
+ `$Z$` is correlated with the variable of interest (education)
  - does not matter which causes which (associattion is enough)
+ `$Z$` is <span style = "color: blue;"> NOT </span> correlated with <span style = "color: red;"> any </span> of the unobservable variables in the error term (including ability) that is making the vairable of interest (education) endogeneous.
  - `$Z$` does not affect ability
  - abiliyt does not affect `$Z$`

---
class: middle

.content-box-green[**The Model**]

`$y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + u$`

+ `$x_1$` is endogenous: `$E[u|x_1] \ne 0$` (or `$Cov(u,x_1)\ne 0$`)
+ `$x_2$` is exogenous: `$E[u|x_1] = 0$` (or `$Cov(u,x_1) = 0$`)

---
class: middle

.content-box-green[**Idea (very loosely put)**]

Bring in variable(s) (.blue[Instrumental variable(s)]) that does .red[NOT] belong to the model, but .red[IS] related with the endogenous variable,

+ Using the instrumental variable(s) (which we denote by `$Z$`), make the endogenous variable exogenous, which we call .blue[instrumented] variable(s)

+ Use the variation in the instrumented variable instead of the original endogenous variable to estimate the impact of the original variable

---
class: middle

# IV estimation procedure

.content-box-green[**Step 1**]
  
Using the instrumental variables, make the endogenous variable exogenous, which we call .blue[instrumented] variable

<br>

.content-box-green[**Step 1: mathematically**]

+ Regress the endogenous variable `$(x_1)$` on the instrumental variable(s) `$(Z=\{z_1,z_2\}$`, two instruments here) and all the other exogenous variables `$(x_2$` here)

`$x_1 = \alpha_0 + \sigma_2 x_2 + \alpha_1 z_1 +\alpha_2 z_2 + v$`

+ obtain the predicted value of `$x$` from the regression

`$\hat{x}_1 = \hat{\alpha}_0 + \hat{\sigma}_2 x_2 + \hat{\alpha}_1 z_1 + \hat{\alpha}_2 z_2$`

---
class: middle

# IV estimation procedure

.content-box-green[**Step 2**]

use the variation in the instrumented variable instead of the original endogenous variable to estimate the impact of the original variable

<br>

.content-box-green[**Step 2: Mathematically**]

Regress the dependent variable `$(y)$` on the instrumented variable `$(\hat{x}_1)$`,

`$y= \beta_0 + \beta_1 \hat{x_1}+ \beta_2 x_2 + \varepsilon$`

to estimate the coefficient on `$x$` in the original model

---
class: middle

# Example

.content-box-green[**Model of interest**]

`$log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + (\beta_3 ability + v)$`

+ Regress `$log(wage)$` on `$educ$` and `$exper$` `$(ability$` not included because you do not observe it)
+ `$(\beta_3 ability + v)$` is the error term
+ `$educ$` is considered endogenous (correlated with `$ability$`)
+ `$exper$` is considered exogenous (not correlated with `$ability$`)

<br>

.content-box-green[**Instruments (Z)**]

Suppose you selected the following variables as instruments:

+ IQ test score `$(IQ)$`
+ number of siblings `$(sibs)$`

---
class: middle

.left4[
.content-box-green[**Step 1:**]

Regress `$educ$` on `$exper$`, `$IQ$`, and `$sibs$`:

`$educ = \alpha_0 + \alpha_1 exper + \alpha_2 IQ + \alpha_3 sibs + u$`

Use the coefficient estimates on `$\alpha_0$`, `$\alpha_1$`, `$\alpha_2$`, and `$\alpha_3$` to predict `$educ$` as a function of `$exper$`, `$IQ$`, and `$sibs$`.

`$\hat{educ} = \hat{\alpha_0} + \hat{\alpha_1} exper + \hat{\alpha_2} IQ + \hat{\alpha_3} sibs$`
]

.right6[

```r
library(wooldridge)
data("wage2")

#* regress educ on exper, IQ, and sibs
first_reg <- feols(educ ~ exper + IQ + sibs, data = wage2)

#* predict educ as a function of exper, IQ, and sibs
wage2 <- mutate(wage2, educ_hat = first_reg$fitted.values)

#* seed the predicted values
wage2 %>%
  relocate(educ_hat) %>%
  head()
```

```
##   educ_hat wage hours  IQ KWW educ exper tenure age married black south urban sibs brthord meduc feduc    lwage
## 1 13.26398  769    40  93  35   12    11      2  31       1     0     0     1    1       2     8     8 6.645091
## 2 14.80686  808    50 119  41   18    11     16  37       1     0     0     1    1      NA    14    14 6.694562
## 3 14.15410  825    40 108  46   14    11      9  33       1     0     0     1    1       2    14    14 6.715384
## 4 12.79569  650    40  96  32   12    13      7  32       1     0     0     1    4       3    12    12 6.476973
## 5 10.73631  562    40  74  27   11    14      5  34       1     0     0     1   10       6     6    11 6.331502
## 6 14.09006 1400    40 116  43   16    14      2  35       1     1     0     1    1       2     8    NA 7.244227
```
]

---
class: middle

.left4[
.content-box-green[**Step 2:**]

Use `$\hat{educ}$` in place of `$educ$` to estimate the model of interest:

`$log(wage) = \beta_0 + \beta_1 \hat{educ} + \beta_2 exper + u$` 
]

.right6[

```r
#* regression with educ_hat in place of educ
second_reg <- feols(wage ~ educ_hat + exper, data = wage2)

#* see the results
second_reg
```

```
## OLS estimation, Dep. Var.: wage
## Observations: 935 
## Standard-errors: IID 
##               Estimate Std. Error  t value   Pr(>|t|)    
## (Intercept) -1269.7880  214.12821 -5.93004 4.2632e-09 ***
## educ_hat      138.1051   13.10586 10.53766  < 2.2e-16 ***
## exper          31.7955    4.14489  7.67101 4.2899e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 382.0   Adj. R2: 0.104547
```
]

---
class: middle

# When does IV work?

Just like OLS needs to satisy some conditions for it to consistently estimate the coefficients, IV approach needs to satisy some conditions for it to work.

.content-box-green[**Estimation Procedure**]

+ Step 1: `$\hat{x}_1 = \hat{\alpha}_0 +\hat{\sigma}_2 x_2 + \hat{\alpha}_1 z_1 + \hat{\alpha}_2 z_2$`

+ Step 2: `$y = \beta_0 + \beta_1 \hat{x_1}+ \beta_2 x_2 + \varepsilon$`

<br>

.content-box-green[**Important question**]

What are the conditions under which IV estimation is consistent?

The instruments `$(Z)$` need to satisfy two conditions, which we will discuss.

---
class: middle

# Condition 1

---
class: middle

.content-box-green[**Estimation Procedure**]

+ Step 1: `$\hat{x}_1 = \hat{\alpha}_0 +\hat{\sigma}_2 x_2 + \hat{\alpha}_1 z_1 + \hat{\alpha}_2 z_2$`

+ Step 2: `$y = \beta_0 + \beta_1 \hat{x_1}+ \beta_2 x_2 + \varepsilon$`

<br>

.content-box-green[**Question**]

What happens if `$Z$` have no power to explain `$x_1$` `$(\alpha_1=0$` and `$\alpha_2=0)$`?

<br>

.content-box-green[**Answer**]

+ `$\hat{x}_1=\hat{\alpha}_0+\hat{\sigma}^2 x_2$`
+ `$\hat{\beta}_1?$`

That is, `$\hat{x_1}$` has no information beyond the information `$x_2$` possesses.

---
class: middle

.content-box-red[**Condition 1**]

The instrument(s) `$Z$` have jointly significant explanatory power on the endogenous variable `$x_1$` .red[after] you control for all the other exogenous variables (here `$x_2)$`

---
class: middle center

# Condition 2

---
class: middle

.content-box-green[**Model of interest**]

`$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u$`

<br>

.content-box-green[**Estimation Procedure**]

+ Step 1: `$\hat{x}_1 = \hat{\alpha}_0 +\hat{\sigma}_2 x_2 + \hat{\alpha}_1 z_1 + \hat{\alpha}_2 z_2$`

+ Step 2: `$y = \beta_0 + \beta_1 \hat{x_1}+ \beta_2 x_2 + \varepsilon$`

Remember you can break `$x_1$` into the predicted part and the residuals.

`$x_1 = \hat{x}_1 + \hat{\varepsilon}$`

where `$\hat{\varepsilon}$` is the residual of the first stage estimation.

Plugging in `$x_1 = \hat{x}_1 + \hat{\varepsilon}$` into the model of interest,

`$y =  \beta_0 + \beta_1 (\hat{x}_1 + \hat{\varepsilon}) + \beta_2 x_2+ u$`

`$\;\;\; = \beta_0 + \beta_1 \hat{x}_1 + \beta_2 x_2+ (\beta_1\hat{\varepsilon} + u)$`

So, if you regress `$y$` on `$\hat{x}_1$` and `$x_2$`, then the error term is `$(\beta_1\hat{\varepsilon} + u)$`.

---
class: middle

.content-box-green[**Second stage regression**]

`$y = \beta_0 + \beta_1 \hat{x}_1 + \beta_2 x_2+ (\beta_1\hat{\varepsilon} + u)$`

<br>

.content-box-green[**Question**]

What is the condition under which the OLS estimation of `$\beta_1$` in the main model is unbiased?

<br>
.content-box-green[**Answer**]

`$\hat{x}_1$` is not correlated with `$(\beta_1\hat{\varepsilon} + u)$`

This in turn means that `$x_2$`, `$z_1$`, and `$z_2$` are not correlated with `$u$` (the error term of the true model.

`$(\hat{x}_1$` is always not correlated (orthogonal) with `$\varepsilon)$`

---
class: middle

.content-box-red[**Condition 2**]

+ `$z_1$` and `$z_2$` do not belong in the main model, meaning they do not have any explanatory power beyond `$x_2$` (they should have been included in the model in the first place as independent variables)

+ `$z_1$` and `$z_2$` are not correlated with the error term (there are no unobserved factors in the error term that are correlated with `$Z)$`

---
class: middle

.content-box-green[**Question**]

Do you think we can test condition 2?

<br>

.content-box-green[**Answer**]

No, because we never observe the error term.

<br>

.content-box-red[**Important**]

+ All we can do is to .red[argue] that the instruments are not correlated with the error term.

+ In journal articles that use IV method, they make careful arguments as to why their choice of instruments are not correlated with the error term.

---
class: middle

.content-box-red[**Condition 1**]

+ The instrument(s) `$Z$` have jointly significant explanatory power on the endogenous variable `$x_1$` .red[after] you control for all the other exogenous variables (here `$x_2)$`}

<br>

.content-box-red[**Condition 2**]

+ `$z_1$` and `$z_2$` are not correlated with the error term (there are no unobserved factors in the error term that are correlated with `$Z)$`

<br>

.content-box-red[**Important**]

+ Condition 1 is always testable

+ Condition 2 is NOT testable (unless you have more instruments than endogenous variables)

---
class: middle

.content-box-green[**Two-stage Least Square (2SLS)**]

IV estimator is also called two-stage least squares estimator (2SLS) because it involves two stages of OLS.

+ Step 1: `$\hat{x}_1 = \hat{\alpha}_0 +\hat{\sigma}_2 x_2 + \hat{\alpha}_1 z_1 + \hat{\alpha}_2 z_2$`

+ Step 2: `$y = \beta_0 + \beta_1 \hat{x_1}+ \beta_2 x_2 + \varepsilon$`

+ 2SLS framework is a good way to understand conceptually why and how instrumental variable estimation works

+ But, IV estimation is done in one-step

---
class: middle

# Instrumental variable validity

---
class: middle

.content-box-green[**The model**]

`$log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)$`

`educ` is endogenous because of its correlation with `ability`.

<br>
--

.content-box-green[**Question**]

What conditions would a good instrument `$(z)$` satisfy?

<br>
--

.content-box-green[**Answer**]

+ `$z$` has explanatory power on `$educ$` .blue[after] you control for the impact of `$epxer$` on `$educ$`

+ `$z$` is uncorrelated with `$v$` `$(ability$` and all the other important unobservables)

---
class: middle

.content-box-green[**The model**]

`$log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)$`

<br>

.content-box-green[**An example of instruments**]

The last digit of an individual's Social Security Number? (this has been actually used in some journal articles)

<br>

.content-box-green[**Question**]

+ Is it uncorrelated with `$v$` `$(ability$` and all the other important unobservables)?

+ does it have explanatory power on `$educ$` .blue[after] you control for the impact of `$epxer$` on `$educ$`?

---
class: middle

.content-box-green[**The model**]

`$log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)$`

<br>

.content-box-green[**An example of instruments**]

IQ test score

<br>

.content-box-green[**Question**]

+ Is it uncorrelated with `$v$` `$(ability$` and all the other important unobservables)?

+ does it have explanatory power on `$educ$` .blue[after] you control for the impact of `$epxer$` on `$educ$`?

---
class: middle

.content-box-green[**The model**]

`$log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)$`

<br>

.content-box-green[**An example of instruments**]

Mother's education

<br>

.content-box-green[**Question**]

+ Is it uncorrelated with `$v$` `$(ability$` and all the other important unobservables)?

+ does it have explanatory power on `$educ$` .blue[after] you control for the impact of `$epxer$` on `$educ$`?

---
class: middle

.content-box-green[**The model**]

`$log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)$`

<br>
.content-box-green[**An example of instruments**]

Number of siblings

<br>

.content-box-green[**Question**]

+ Is it uncorrelated with `$v$` `$(ability$` and all the other important unobservables)?

+ does it have explanatory power on `$educ$` .blue[after] you control for the impact of `$epxer$` on `$educ$`?

---
class: middle

class: inverse, center, middle
name: iv-r

# Implementation of Instrumental Variable (IV) Estimation in R

---
class: middle

.content-box-green[**Model**]

`$log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; (=\beta_3 ability + u)$`

We believe

+ `$educ$` is endogenous `$(x_1)$`

+ `$exper$` is exogenous `$(x_2)$`

+ we use the number of siblings `$(sibs)$` and father's education `$(feduc)$` as the instruments ($Z$)

<br>

.content-box-green[**Terminology**]

+ exogenous variable included in the model (here, `$exper$`) is also called .blue[included instruments]

+ instruments that do not belong to the main model (here, `$sibs$` and `$feduc$`) are also called .blue[excluded instruments]

+ we refer to the collection of included and excluded instruments as .blue[instruments]

---
class: middle

.content-box-green[**Dataset**]

```r
#--- take a look at the data ---#
wage2 %>%
  select(wage, educ, sibs, feduc) %>%
  head()
```

```
##   wage educ sibs feduc
## 1  769   12    1     8
## 2  808   18    1    14
## 3  825   14    1    14
## 4  650   12    4    12
## 5  562   11   10    11
## 6 1400   16    1    NA
```

---
class: middle

We can continue to use the `fixest` package to run IV estimation method.

```r
library(fixest)
```

.content-box-green[**Syntax**]

```r
felm(dep var ~ included instruments|first stage formula, data = dataset)
```

+ `included instruments`: exogenous included variables (do not include endogenous variables here)

<br>

.content-box-green[**first stage formula**]

```r
(endogenous vars ~ excluded instruments)
```

<br>

.content-box-green[**Example**]

```r
iv_res <- feols(log(wage) ~ exper | educ ~ sibs + feduc, data = wage2)
```

+ `included variable`: 
  * exogenous included variables: `exper`
  * endogenous included variables: `educ` 
+ `instruments`: 
  * included instruments: `exper` 
  * excluded instruments: `sibs` and `feduc`

---
class: middle

.content-box-red[**IV regression results**]

```r
iv_res
```

```
## TSLS estimation, Dep. Var.: log(wage), Endo.: educ, Instr.: sibs, feduc
## Second stage: Dep. Var.: log(wage)
## Observations: 741 
## Standard-errors: IID 
##             Estimate Std. Error  t value   Pr(>|t|)    
## (Intercept) 4.507316   0.315735 14.27564  < 2.2e-16 ***
## fit_educ    0.137405   0.019215  7.15104 2.0766e-12 ***
## exper       0.037029   0.005694  6.50306 1.4502e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.406208   Adj. R2: 0.049979
## F-test (1st stage), educ: stat = 65.6     , p < 2.2e-16 , on 2 and 737 DoF.
##               Wu-Hausman: stat = 13.2     , p = 3.051e-4, on 1 and 737 DoF.
##                   Sargan: stat =  0.230925, p = 0.630838, on 1 DoF.
```

<br>

.content-box-green[**Note**]

+ When variable `x` is the endogenous variable, `fixest` changes the name of `x` to `x(fit)`.

+ Here, `educ` has become `educ(fit)`.

---
class: middle

.content-box-green[**Comparison of OLS and IV Estimation Results**]

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:left;"> Model 1 </th>
   <th style="text-align:left;"> Model 2 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:left;"> 5.503*** </td>
   <td style="text-align:left;"> 4.507*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> (0.112) </td>
   <td style="text-align:left;"> (0.316) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> educ </td>
   <td style="text-align:left;"> 0.078*** </td>
   <td style="text-align:left;">  </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> (0.007) </td>
   <td style="text-align:left;">  </td>
  </tr>
  <tr>
   <td style="text-align:left;"> exper </td>
   <td style="text-align:left;"> 0.020*** </td>
   <td style="text-align:left;"> 0.037*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> (0.003) </td>
   <td style="text-align:left;"> (0.006) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> fit_educ </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> 0.137*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> (0.019) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Num.Obs. </td>
   <td style="text-align:left;"> 935 </td>
   <td style="text-align:left;"> 741 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> R2 </td>
   <td style="text-align:left;"> 0.131 </td>
   <td style="text-align:left;"> 0.053 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Std. errors </td>
   <td style="text-align:left;"> IID </td>
   <td style="text-align:left;"> IID </td>
  </tr>
</tbody>
<tfoot>
<tr>
<td style="padding: 0; border:0;" colspan="100%">
<sup></sup> * p &lt; 0.1, ** p &lt; 0.05, *** p &lt; 0.01</td>
</tr>
</tfoot>
</table>

.content-box-green[**Question**]

Do you think `$sibs$` and `$feduc$` are good instruments?

+ Condition 1: weak instruments?
+ Condition 2: uncorrelated with the error term?

---
class: middle

.content-box-green[**Weak Instrument Test**]

We can always test if the excluded instruments are weak or not (test of condition 1).

<br>

.content-box-green[**How**]

+ Run the 1st stage regression

`$educ = \alpha_0 + \alpha_1 exper + \alpha_2 sibs + \alpha_3 feduc + v$`

+ test the joint significance of `$\alpha_2$` and `$\alpha_3$` `$(F$`-test)

If excluded instruments `$(sibs$` and `$feduc$`) are jointly significant, then it would mean that `$sibs$` and `$feduc$` are not weak instruments, satisfying condition 1.

---
class: middle

When we ran the IV estimation using `fixest::feols()` earlier, it automatically calculated the F-statistic for the weak instrument test.

```r
iv_res
```

```
## TSLS estimation, Dep. Var.: log(wage), Endo.: educ, Instr.: sibs, feduc
## Second stage: Dep. Var.: log(wage)
## Observations: 741 
## Standard-errors: IID 
##             Estimate Std. Error  t value   Pr(>|t|)    
## (Intercept) 4.507316   0.315735 14.27564  < 2.2e-16 ***
## fit_educ    0.137405   0.019215  7.15104 2.0766e-12 ***
## exper       0.037029   0.005694  6.50306 1.4502e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.406208   Adj. R2: 0.049979
*## F-test (1st stage), educ: stat = 65.6     , p < 2.2e-16 , on 2 and 737 DoF.
##               Wu-Hausman: stat = 13.2     , p = 3.051e-4, on 1 and 737 DoF.
##                   Sargan: stat =  0.230925, p = 0.630838, on 1 DoF.
```

Here, F-test for the null hypothesis of the excluded instruments (`sibs` and `feduc`) do not have any explanatory power on the endogenous variable (`educ`) beyond the included instrument (`exper`) is rejected.

---
class: middle

Alternatively, you can access the `iv_first_stage` component of the 
regression results.

```r
iv_res$iv_first_stage
```

```
## $educ
## TSLS estimation, Dep. Var.: educ, Endo.: educ, Instr.: sibs, feduc
## First stage: Dep. Var.: educ
## Observations: 741 
## Standard-errors: IID 
##              Estimate Std. Error   t value   Pr(>|t|)    
## (Intercept) 14.075273   0.358595  39.25116  < 2.2e-16 ***
## sibs        -0.131009   0.030800  -4.25357 2.3749e-05 ***
## feduc        0.205169   0.021909   9.36459  < 2.2e-16 ***
## exper       -0.191535   0.016373 -11.69819  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 1.84505   Adj. R2: 0.319802
## F-test (1st stage): stat = 65.6, p < 2.2e-16, on 2 and 737 DoF.
```

---
class: middle

.content-box-green[**Notes**]

+ It is generally recommended that you have `$F$`-stat of over `$10$` (this is not a clear-cut criteria that applied to all the empirical cases)

+ Even if you reject the null if `$F$`-stat is small, you may have a problem

+ You know nothing about if your excluded instruments satisfy Condition 2.

+ If you cannot reject the null, it is a strong indication that your instruments are weak. Look for other instruments.

+ Always, always report this test. There is no reason not to.

---
class: middle

# Consequences of weak instruments

.content-box-green[**Data generation**]

```r
set.seed(73289)
N <- 500 # number of observations

u_common <- runif(N) # the term shared by the endogenous variable and the error term
z_common <- runif(N) # the term shared by the endogenous variable and instruments
x_end <- u_common + z_common + runif(N) # the endogenous variable
z_strong <- z_common + runif(N) # strong instrument
z_weak <- 0.01 * z_common + 0.99995 * runif(N) # weak instrument
u <- u_common + runif(N) # error term
y <- x_end + u # dependent variable

data <- data.frame(y, x_end, z_strong, z_weak)
```

---
class: middle

.content-box-green[**Correlation**]

```r
cor(data)
```

```
##                   y       x_end    z_strong       z_weak
## y         1.0000000  0.86492868 0.298704509 -0.108007146
## x_end     0.8649287  1.00000000 0.419011491 -0.074224622
## z_strong  0.2987045  0.41901149 1.000000000  0.003839565
## z_weak   -0.1080071 -0.07422462 0.003839565  1.000000000
```

---
class: middle

.content-box-green[**Estimation with the strong instrumental variable**]

```r
#--- IV estimation (strong) ---#
iv_strong <- feols(y ~ 1 | x_end ~ z_strong, data = data)
```

<br>

.content-box-green[**Estimation with the weak instrumental variable**]

```r
#--- IV estimation (weak) ---#
iv_weak <- feols(y ~ 1 | x_end ~ z_weak, data = data)
```

---
class: middle

```r
#--- coefs (strong) ---#
tidy(iv_strong)
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    0.883    0.133       6.64 8.20e-11
## 2 fit_x_end      1.09     0.0856     12.7  2.96e-32
```

```r
#--- coefs (weak) ---#
tidy(iv_weak)
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   -0.862     1.10     -0.784 0.434  
## 2 fit_x_end      2.22      0.714     3.11  0.00197
```

.content-box-green[**Question**]

Any notable differences?

The coefficient estimate on `$x\_end$` is far away from the true value in the weak instrument case.

---
class: middle

.content-box-green[**Comparison of the weak instrument tests**]

.scroll-box-10[

```r
#--- diagnostics (strong) ---#
iv_strong$iv_first_stage
```

```
## $x_end
## TSLS estimation, Dep. Var.: x_end, Endo.: x_end, Instr.: z_strong
## First stage: Dep. Var.: x_end
## Observations: 500 
## Standard-errors: IID 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 1.020267   0.054304 18.7881 < 2.2e-16 ***
## z_strong    0.507831   0.049312 10.2983 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.441428   Adj. R2: 0.173915
## F-test (1st stage): stat = 106.1, p < 2.2e-16, on 1 and 498 DoF.
```
]

.scroll-box-10[

```r
#--- diagnostics (weak) ---#
iv_weak$iv_first_stage
```

```
## $x_end
## TSLS estimation, Dep. Var.: x_end, Endo.: x_end, Instr.: z_weak
## First stage: Dep. Var.: x_end
## Observations: 500 
## Standard-errors: IID 
##              Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept)  1.602495   0.042885 37.36745 < 2.2e-16 ***
## z_weak      -0.124495   0.074953 -1.66097  0.097348 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.484824   Adj. R2: 0.003512
## F-test (1st stage): stat = 2.75883, p = 0.097348, on 1 and 498 DoF.
```
]

<br>

.content-box-green[**Question**]

Any notable differences?

You cannot reject the null hypothesis of weak instrument in the weak instrument case.

---
class: middle

.content-box-green[**MC simulation**]

```r
B <- 1000 # the number of experiments
beta_hat_store <- matrix(0, B, 2) # storage of beta hat

for (i in 1:B) {

#--- data generation ---#
  u_common <- runif(N)
  z_common <- runif(N)
  x_end <- u_common + z_common + runif(N)
  z_strong <- z_common + runif(N)
  z_weak <- 0.01 * z_common + 0.99995 * runif(N)
  u <- u_common + runif(N)
  y <- x_end + u
  data <- data.table(y, x_end, z_strong, z_weak)

#--- IV estimation with a strong instrument ---#
  iv_strong <- feols(y ~ 1 | x_end ~ z_strong, data = data)
  beta_hat_store[i, 1] <- iv_strong$coefficients[2]

#--- IV estimation with a weak instrument ---#
  iv_weak <- feols(y ~ 1 | x_end ~ z_weak, data = data)
  beta_hat_store[i, 2] <- iv_weak$coefficients[2]
}
```

---
class: middle

.content-box-green[**Visualization of the MC Results**]

---
class: middle

.content-box-green[**Visualization of the MC Results**]

---
class: middle

.content-box-green[**Flow of IV Estimation in Practice**]

+ Identify endogenous variable(s) and included instrument(s)

+ Identify potential excluded instrument(s)

+ .red[Argue] why the excluded instrument(s) you pick is uncorrelated with the error term (.content-box-red[**condition 2**])

+ Once you decide what variable(s) to use as excluded instruments, .red[test] whether the excluded instrument(s) is weak or not (
.content-box-red[**condition 1**])

+ Implement IV estimation and report the results

---
class: middle

You can include fixed effects in your IV estimation.

.content-box-green[**Syntax**]

```r
feols(dep var ~ included instruments | FE | 1st stage formula, data = dataset)
```

.content-box-green[**Example**]

Include `married` and `south` as fixed effects.

```r
feols(log(wage) ~ exper | married + south | educ ~ feduc + sibs, data = wage2)
```

```
## TSLS estimation, Dep. Var.: log(wage), Endo.: educ, Instr.: feduc, sibs
## Second stage: Dep. Var.: log(wage)
## Observations: 741 
## Fixed-effects: married: 2,  south: 2
## Standard-errors: Clustered (married) 
##          Estimate Std. Error t value Pr(>|t|)    
## fit_educ 0.124355   0.003627 34.2906 0.018560 *  
## exper    0.032128   0.002260 14.2144 0.044713 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.391178     Adj. R2: 0.116588
##                  Within R2: 0.069595
## F-test (1st stage), educ: stat = 61.1     , p < 2.2e-16 , on 2 and 736 DoF.
##               Wu-Hausman: stat =  8.98498 , p = 0.002814, on 1 and 735 DoF.
##                   Sargan: stat =  0.169226, p = 0.6808  , on 1 DoF.
```

---
class: middle

Clustered SE? You can just add `cluster = ` option just like we previously did.

```r
feols(log(wage) ~ exper | married + south | educ ~ feduc + sibs, cluster = ~black, data = wage2)
```

```
## TSLS estimation, Dep. Var.: log(wage), Endo.: educ, Instr.: feduc, sibs
## Second stage: Dep. Var.: log(wage)
## Observations: 741 
## Fixed-effects: married: 2,  south: 2
## Standard-errors: Clustered (black) 
##          Estimate Std. Error t value Pr(>|t|)    
## fit_educ 0.124355   0.005258 23.6526 0.026899 *  
## exper    0.032128   0.002798 11.4842 0.055295 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.391178     Adj. R2: 0.116588
##                  Within R2: 0.069595
## F-test (1st stage), educ: stat = 61.9     , p < 2.2e-16 , on 2 and 735 DoF.
##               Wu-Hausman: stat =  8.98498 , p = 0.002814, on 1 and 735 DoF.
##                   Sargan: stat =  0.169226, p = 0.6808  , on 1 DoF.
```