Applied Econometrics (AECN 896-004) – 11: Dealing with Endogeneity: Instrumental Variable

Instrumental Variable (IV) Approach

Review of Endogeneity
Causal Diagram
Rough idea
Estimation procedure
Example

Endogeneity

\(E[u|x_k] \ne 0\) (the error term is not correlated with any of the independent variables)

Endogenous independent variable

If the error term is, for whatever reason, correlated with the independent variable \(x_k\), then we say that \(x_k\) is an endogenous independent variable.

Omitted variable
Selection
Reverse causality
Measurement error

Causal diagram of the model
Causal diagram with an instrument

You want to estimate the causal impact of education on income.

Variable of interest: Education
Dependent variable: Income

Causal diagram

We want to find a variable like \(Z\) in the diagram below:

\(Z\) does NOT affect income directly
\(Z\) is correlated with the variable of interest (education)
- does not matter which causes which (associattion is enough)
\(Z\) is NOT correlated with any of the unobservable variables in the error term (including ability) that is making the vairable of interest (education) endogeneous.
- \(Z\) does not affect ability
- abiliyt does not affect \(Z\)

The Model

\(y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + u\)

\(x_1\) is endogenous: \(E[u|x_1] \ne 0\) (or \(Cov(u,x_1)\ne 0\))
\(x_2\) is exogenous: \(E[u|x_1] = 0\) (or \(Cov(u,x_1) = 0\))

Idea (very loosely put)

Bring in variable(s) ( Instrumental variable(s) ) that does NOT belong to the model, but IS related with the endogenous variable,

Using the instrumental variable(s) (which we denote by \(Z\)), make the endogenous variable exogenous, which we call instrumented variable(s)
Use the variation in the instrumented variable instead of the original endogenous variable to estimate the impact of the original variable

Step 1
Step 2

Idea

Using the instrumental variables, make the endogenous variable exogenous, which we call instrumented variable.

Step 1: mathematically

Regress the endogenous variable \((x_1)\) on the instrumental variable(s) \((Z=\{z_1,z_2\}\), two instruments here) and all the other exogenous variables \((x_2\) here)

\(x_1 = \alpha_0 + \sigma_2 x_2 + \alpha_1 z_1 +\alpha_2 z_2 + v\)

obtain the predicted value of \(x\) from the regression

\(\widehat{x}_1 = \widehat{\alpha}_0 + \widehat{\sigma}_2 x_2 + \widehat{\alpha}_1 z_1 + \widehat{\alpha}_2 z_2\)

Idea

Use the variation in the instrumented variable instead of the original endogenous variable to estimate the impact of the original variable

Step 2: Mathematically

Regress the dependent variable \((y)\) on the instrumented variable \((\widehat{x}_1)\),

\(y= \beta_0 + \beta_1 \widehat{x}_1+ \beta_2 x_2 + \varepsilon\)

to estimate the coefficient on \(x\) in the original model

Model of interest
Step 1
Step 2

Model

\(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + (\beta_3 ability + v)\)

Regress \(log(wage)\) on \(educ\) and \(exper\) \((ability\) not included because you do not observe it)
\((\beta_3 ability + v)\) is the error term
\(educ\) is considered endogenous (correlated with \(ability\))
\(exper\) is considered exogenous (not correlated with \(ability\))

Instruments (Z)

Suppose you selected the following variables as instruments:

IQ test score \((IQ)\)
number of siblings \((sibs)\)

Regress \(educ\) on \(exper\), \(IQ\), and \(sibs\):

\[\begin{align*} educ = \alpha_0 + \alpha_1 exper + \alpha_2 IQ + \alpha_3 sibs + u \end{align*}\]

Use the coefficient estimates on \(\alpha_0\), \(\alpha_1\), \(\alpha_2\), and \(\alpha_3\) to predict \(educ\) as a function of \(exper\), \(IQ\), and \(sibs\).

\[\begin{align*} \widehat{educ} = \widehat{\alpha_0} + \widehat{\alpha_1} exper + \widehat{\alpha_2} IQ + \widehat{\alpha_3} sibs \end{align*}\]

Use \(\widehat{educ}\) in place of \(educ\) to estimate the model of interest:

\[\begin{align*} log(wage) = \beta_0 + \beta_1 \widehat{educ} + \beta_2 exper + u \end{align*}\]

Just like OLS needs to satisy some conditions for it to consistently estimate the coefficients, IV approach needs to satisy some conditions for it to work.

Estimation Procedure

Step 1: \(\widehat{x}_1 = \widehat{\alpha}_0 +\widehat{\sigma}_2 x_2 + \widehat{\alpha}_1 z_1 + \widehat{\alpha}_2 z_2\)
Step 2: \(y = \beta_0 + \beta_1 \widehat{x}_1+ \beta_2 x_2 + \varepsilon\)

Important question

What are the conditions under which IV estimation is consistent?

The instruments \((Z)\) need to satisfy two conditions, which we will discuss.

Estimation Procedure

Step 1: \(\widehat{x}_1 = \widehat{\alpha}_0 +\widehat{\sigma}_2 x_2 + \widehat{\alpha}_1 z_1 + \widehat{\alpha}_2 z_2\)
Step 2: \(y = \beta_0 + \beta_1 \widehat{x}_1+ \beta_2 x_2 + \varepsilon\)

Question

What happens if \(Z\) have no power to explain \(x_1\) \((\alpha_1=0\) and \(\alpha_2=0)\)?

Answer

\(\widehat{x}_1=\widehat{\alpha}_0+\widehat{\sigma}^2 x_2\)
\(\widehat{\beta}_1?\)

That is, \(\widehat{x}_1\) has no information beyond the information \(x_2\) possesses.

The instrument(s) \(Z\) have jointly significant explanatory power on the endogenous variable \(x_1\) after you control for all the other exogenous variables (here \(x_2\))

Model of interest

\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u\)

Estimation Procedure

Step 1: \(\widehat{x}_1 = \widehat{\alpha}_0 +\widehat{\sigma}_2 x_2 + \widehat{\alpha}_1 z_1 + \widehat{\alpha}_2 z_2\)
Step 2: \(y = \beta_0 + \beta_1 \widehat{x}_1+ \beta_2 x_2 + \varepsilon\)

Investigation of the model
When is it unbiased?
So,

Remember you can break \(x_1\) into the predicted part and the residuals.

\[\begin{align*} x_1 = \widehat{x}_1 + \widehat{\varepsilon} \end{align*}\]

where \(\widehat{\varepsilon}\) is the residual of the first stage estimation.

Plugging in \(x_1 = \widehat{x}_1 + \widehat{\varepsilon}\) into the model of interest,

\[\begin{align*} y & = \beta_0 + \beta_1 (\widehat{x}_1 + \widehat{\varepsilon}) + \beta_2 x_2+ u\\ & = \beta_0 + \beta_1 \widehat{x}_1 + \beta_2 x_2+ (\beta_1\widehat{\varepsilon} + u) \end{align*}\]

So, if you regress \(y\) on \(\widehat{x}_1\) and \(x_2\), then the error term is \((\beta_1\widehat{\varepsilon} + u)\).

Question

What is the condition under which the OLS estimation of \(\beta_1\) in the main model is unbiased?

Answer

\(\widehat{x}_1\) is not correlated with \((\beta_1\widehat{\varepsilon} + u)\)

We confirmed that we need the following condition to be satisfied:

Note

\(\widehat{x}_1\) is not correlated with \((\beta_1\widehat{\varepsilon} + u)\)

This in turn means that \(x_2\), \(z_1\), and \(z_2\) are not correlated with \(u\) (the error term of the true model.)

\((\widehat{x}_1\) is always not correlated (orthogonal) with \(\varepsilon)\)

Condition 2

\(z_1\) and \(z_2\) do not belong in the main model, meaning they do not have any explanatory power beyond \(x_2\) (they should have been included in the model in the first place as independent variables)
\(z_1\) and \(z_2\) are not correlated with the error term (there are no unobserved factors in the error term that are correlated with \(Z\))

Question

Do you think we can test condition 2?

Answer

No, because we never observe the error term.

Important

All we can do is to argue that the instruments are not correlated with the error term.
In journal articles that use IV method, they make careful arguments as to why their choice of instruments are not correlated with the error term.

Condition 1

The instrument(s) \(Z\) have jointly significant explanatory power on the endogenous variable \(x_1\) after you control for all the other exogenous variables (here \(x_2\))

Condition 2

\(z_1\) and \(z_2\) do not belong in the main model, meaning they do not have any explanatory power beyond \(x_2\) (they should have been included in the model in the first place as independent variables)
\(z_1\) and \(z_2\) are not correlated with the error term (there are no unobserved factors in the error term that are correlated with \(Z)\)

Important

Condition 1 is always testable
Condition 2 is NOT testable (unless you have more instruments than endogenous variables)

IV estimator is also called two-stage least squares estimator (2SLS) because it involves two stages of OLS.

Step 1: \(\widehat{x}_1 = \widehat{\alpha}_0 +\widehat{\sigma}_2 x_2 + \widehat{\alpha}_1 z_1 + \widehat{\alpha}_2 z_2\)
Step 2: \(y = \beta_0 + \beta_1 \widehat{x}_1+ \beta_2 x_2 + \varepsilon\)

2SLS framework is a good way to understand conceptually why and how instrumental variable estimation works. But, IV estimation is done in one-step

Instrumental variable validity

Setup
Instrument example 1
Instrument example 2
Instrument example 3
Instrument example 4

The model

\(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)\)

educ is endogenous because of its correlation with ability.

Question

What conditions would a good instrument \((z)\) satisfy?

Answer

\(z\) has explanatory power on \(educ\) after you control for the impact of \(epxer\) on \(educ\)
\(z\) is uncorrelated with \(v\) (\(ability\) and all the other important unobservables)

The model

\(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)\)

Instrument

The last digit of an individual’s Social Security Number? (this has been actually used in some journal articles)

Question

Is it uncorrelated with \(v\) (\(ability\) and all the other important unobservables)?
does it have explanatory power on \(educ\) after you control for the impact of \(epxer\) on \(educ\)?

The model

\(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)\)

Instrument

IQ test score

Question

Is it uncorrelated with \(v\) (\(ability\) and all the other important unobservables)?
does it have explanatory power on \(educ\) after you control for the impact of \(epxer\) on \(educ\)?

The model

\(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)\)

Instrument

Mother’s education

Question

Is it uncorrelated with \(v\) (\(ability\) and all the other important unobservables)?
does it have explanatory power on \(educ\) after you control for the impact of \(epxer\) on \(educ\)?

The model

\(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)\)

Instrument

Number of siblings

Question

Is it uncorrelated with \(v\) (\(ability\) and all the other important unobservables)?
does it have explanatory power on \(educ\) after you control for the impact of \(epxer\) on \(educ\)?

Implementation of Instrumental Variable (IV) Estimation in R

Introduction
Dataset
How
Results
Comparison
Fixed effects
Clustered SE

Model

\(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; (=\beta_3 ability + u)\)

We believe

\(educ\) is endogenous \((x_1)\)
\(exper\) is exogenous \((x_2)\)
we use the number of siblings \((sibs)\) and father’s education \((feduc)\) as the instruments (\(Z\))

Terminology

exogenous variable included in the model (here, \(exper\)) is also called included instruments
instruments that do not belong to the main model (here, \(sibs\) and \(feduc\)) are also called excluded instruments
we refer to the collection of included and excluded instruments as instruments

We can continue to use the fixest package to run IV estimation method.

Syntax

fixest::feols(dep var ~ included instruments | first stage formula, data = dataset)

included instruments: exogenous included variables (do not include endogenous variables here)

first stage formula

(endogenous vars ~ excluded instruments)

Example

included variable:
- exogenous included variables: exper
- endogenous included variables: educ
instruments:
- included instruments: exper
- excluded instruments: sibs and feduc

IV regression results

Note

When variable x is the endogenous variable, fixest changes the name of x to x(fit).
Here, educ has become educ(fit).

Comparison of OLS and IV Estimation Results

tinytable_hpaju05ipsig5mqe4qz4

	(1)	(2)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	5.503***	4.507***
	(0.112)	(0.316)
educ	0.078***
	(0.007)
exper	0.020***	0.037***
	(0.003)	(0.006)
fit_educ		0.137***
		(0.019)
Num.Obs.	935	741
R2	0.131	0.053
RMSE	0.39	0.41
Std.Errors	IID	IID

Question

Do you think \(sibs\) and \(feduc\) are good instruments?

Condition 1: weak instruments?
Condition 2: uncorrelated with the error term?

You can include fixed effects in your IV estimation.

Syntax

fixest::feols(dep var ~ included instruments | FE | 1st stage formula, data = dataset)

Example

Include married and south as fixed effects.

fixest::feols(log(wage) ~ exper | married + south | educ ~ feduc + sibs, data = wage2)

TSLS estimation, Dep. Var.: log(wage), Endo.: educ, Instr.: feduc, sibs
Second stage: Dep. Var.: log(wage)
Observations: 741 
Fixed-effects: married: 2,  south: 2
Standard-errors: Clustered (married) 
         Estimate Std. Error t value Pr(>|t|)    
fit_educ 0.124355   0.003627 34.2906 0.018560 *  
exper    0.032128   0.002260 14.2144 0.044713 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.391178     Adj. R2: 0.116588
                 Within R2: 0.069595
F-test (1st stage), educ: stat = 62.0     , p < 2.2e-16 , on 2 and 736 DoF.
              Wu-Hausman: stat =  8.98498 , p = 0.002814, on 1 and 735 DoF.
                  Sargan: stat =  0.169226, p = 0.6808  , on 1 DoF.

You can just add cluster = option just like we previously did.

fixest::feols(log(wage) ~ exper | married + south | educ ~ feduc + sibs, cluster = ~black, data = wage2)

TSLS estimation, Dep. Var.: log(wage), Endo.: educ, Instr.: feduc, sibs
Second stage: Dep. Var.: log(wage)
Observations: 741 
Fixed-effects: married: 2,  south: 2
Standard-errors: Clustered (black) 
         Estimate Std. Error t value Pr(>|t|)    
fit_educ 0.124355   0.005258 23.6526 0.026899 *  
exper    0.032128   0.002798 11.4842 0.055295 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.391178     Adj. R2: 0.116588
                 Within R2: 0.069595
F-test (1st stage), educ: stat = 61.9     , p < 2.2e-16 , on 2 and 735 DoF.
              Wu-Hausman: stat =  8.98498 , p = 0.002814, on 1 and 735 DoF.
                  Sargan: stat =  0.169226, p = 0.6808  , on 1 DoF.

Weak instrument

What is it?
How?
Consequences of weak instruments

Definition: weak instrument

The external instrument(s) does not have enough explanatory power on the instrumented (endogenous) variabl beyond the other controls.

Important

We can always test if the excluded instruments are weak or not!

Steps
R implementation

Run the 1st stage regression

\[\begin{align*} educ = \alpha_0 + \alpha_1 exper + \alpha_2 sibs + \alpha_3 feduc + v \end{align*}\]

Then, test the joint significance of \(\alpha_2\) and \(\alpha_3\) (\(F\)-test)

If excluded instruments \((sibs\) and \(feduc\), here) are jointly significant, then it would mean that \(sibs\) and \(feduc\) are not weak instruments, satisfying condition 1.

When we ran the IV estimation using fixest::feols() earlier, it automatically calculated the F-statistic for the weak instrument test.

Here, F-test for the null hypothesis of the excluded instruments (sibs and feduc) do not have any explanatory power on the endogenous variable (educ) beyond the included instrument (exper) is rejected.

Alternatively

You can access the iv_first_stage component of the regression results.

Note

It is generally recommended that you have \(F\)-stat of over \(10\) (this is not a clear-cut criteria that applied to all the empirical cases)
Even if you reject the null if \(F\)-stat is small, you may have a problem
You know nothing about if your excluded instruments satisfy Condition 2.
If you cannot reject the null, it is a strong indication that your instruments are weak. Look for other instruments.
Always, always report this test. There is no reason not to.

Data generating process
Estimation
Comparison of the weak instrument tests
MC simulation

Data generation

Correlation

Estimation with the strong instrumental variable

Estimation with the weak instrumental variable

Question

Any notable differences?

Answer

The coefficient estimate on \(x\_end\) is far away from the true value in the weak instrument case.

diagnostics (strong instrument)

diagnostics (weak instrument)

Note

You cannot reject the null hypothesis of weak instrument in the weak instrument case.

Run
Visualization of the results

set.seed(238354)
B <- 1000 # the number of experiments
N <- 500 # number of observations
beta_hat_store <- matrix(0, B, 2) # storage of beta hat

for (i in 1:B) {

  #--- data generation ---#
  u_common <- runif(N)
  z_common <- runif(N)
  x_end <- u_common + z_common + runif(N)
  z_strong <- z_common + runif(N)
  z_weak <- 0.01 * z_common + 0.99995 * runif(N)
  u <- u_common + runif(N)
  y <- x_end + u
  data <- data.table(y, x_end, z_strong, z_weak)

  #--- IV estimation with a strong instrument ---#
  iv_strong <- fixest::feols(y ~ 1 | x_end ~ z_strong, data = data)
  beta_hat_store[i, 1] <- iv_strong$coefficients[2]

  #--- IV estimation with a weak instrument ---#
  iv_weak <- fixest::feols(y ~ 1 | x_end ~ z_weak, data = data)
  beta_hat_store[i, 2] <- iv_weak$coefficients[2]
}

Code

melted <- melt(data.table(beta_hat_store))
melted[variable == "V1", variable := "Strong"]
melted[variable == "V2", variable := "Weak"]

ggplot(data = melted[abs(value) < 5, ]) +
  geom_density(aes(x = value, fill = variable), alpha = 0.3) +
  geom_vline(xintercept = 1, color = "red") +
  scale_fill_discrete(name = "") +
  theme(
    legend.position = "bottom"
  )

Flow of IV Estimation in Practice

Flow

Identify endogenous variable(s) and included instrument(s)
Identify potential excluded instrument(s)
Argue why the excluded instrument(s) you pick is uncorrelated with the error term (condition 2)
Once you decide what variable(s) to use as excluded instruments, test whether the excluded instrument(s) is weak or not ( condition 1)
Implement IV estimation and report the results