04: Omitted Variable Bias and Multicollinearity

What variables to include or not

You often

  • face the decision of whether you should be including a particular variable or not: how do you make a right decision?

  • miss a variable that you know is important because it is not simply available: what are the consequences?

Two important concepts you need to be aware of:

  • Multicollinearity
  • Omitted Variable Bias

Multicollinearity and Omitted Variable Bias

Definition: Multicollinearity

A phenomenon where two or more variables are highly correlated (negatively or positively) with each other ( consequences? )


Definition: Omitted Variable Bias

Bias caused by not including (omitting) important variables in the model

Consider the following model,

\[ y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i \]

Your interest is in estimating the impact of \(x_1\) on \(y\).

Objective

Using this simple model, we investigate what happens to the coefficient estimate on \(x_1\) if you include/omit \(x_2\).

The model: \[y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\]

Case 1:

What happens if \(\beta_2=0\), but include \(x_2\) that is not correlated with \(x_1\)?

Case 2:

What happens if \(\beta_2=0\), but include \(x_2\) that is highly correlated with \(x_1\)?

Case 3:

What happens if \(\beta_2\ne 0\), but omit \(x_2\) that is not correlated with \(x_1\)?

Case 4:

What happens if \(\beta_2\ne 0\), but omit \(x_2\) that is highly correlated with \(x_1\)?

  • Is \(\widehat{\beta}_1\) unbiased, that is \(E[\widehat{\beta}_1]=\beta_1\)?

  • \(Var(\widehat{\beta}_1)\)? (how accurate the estimation of \(\widehat{\beta}_1\) is)

Case 1

True Model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) = 0\)
  • \(\beta_2=0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Example

\(\mbox{corn yield} = \beta_0 + \beta_1 \times N + \beta_2 \mbox{farmers' height} + u\)

We will estimate the following models:


\(EE_1\): \(y_i=\beta_0 + \beta_1 x_{1,i} + v_i \mbox{ , where } (v_i = \beta_2 x_{2,i} + u_i)\)

\(EE_2\): \(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

(Only \(x_1\) is included in \(EE_1\), while \(x_1\) and \(x_2\) are included in \(EE_2\).)


Question

What do you think is gonna happen? Any guess?

  • \(E[\widehat{\beta}_1]=\beta_1\) in \(EE_1\)? (bias?)
  • \(E[\widehat{\beta}_1]=\beta_1\) in \(EE_2\)? (bias?)
  • \(Var(\widehat{\beta}_1)\) in \(EE_2\) \(\gtreqqless\) \(Var(\widehat{\beta}_1)\) in \(EE_2\)?

Set up simulations:


Run MC simulations:


Visualize the results:

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) = 0\)
  • \(\beta_2=0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Estimated model

\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (v_i = \beta_2 x_{2,i} + u_{i})\)


\(E[v_i|x_{1,i}]=0?\)


Answer Yes, because \(x_1\) is not correlated with either of \(x_2\) and \(u\). So, no bias.

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) = 0\)
  • \(\beta_2=0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Estimated model

\(EE_2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)


\(E[u_i|x_{1,i},x_{2,i}]=0\)?


Answer Yes, because \(x_1\) and \(x_2\) are not correlated with \(u\) (by assumption). So, no bias.

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) = 0\)
  • \(\beta_2=0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Variance

\(Var(\widehat{\beta}_j)= \frac{\sigma_v^2}{SST_j(1-R^2_j)}\)

where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.

The estimated model

\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)


\(R_j^2\)?


Answer 0 because there are no other variables included in the model.

\(Var(v_i) = Var(\beta_2 x_i + u_i)\)?


Answer

\[ Var(u_i) = Var(\beta_2 x_i + u_i) = \sigma_u^2 \] because \(\beta_2 = 0\).

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) = 0\)
  • \(\beta_2=0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Variance

\(Var(\widehat{\beta}_j)= \frac{\sigma_u^2}{SST_j(1-R^2_j)}\)

where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.

The estimated model

\(EE2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)


\(R_j^2\)?


Answer 0 on average because \(cor(x_1, x_2)=0\)

\(Var(u_i)\)?


Answer \[ Var(u_i) = \sigma_u^2 \]

Question

\(Var(\widehat{\beta}_1)\) in \(EE_1\) \(\gtreqqless\) \(Var(\widehat{\beta}_1)\) in \(EE_2\)?


\(EE_1\)

  • \(R_j^2 = 0\)
  • \(Var(error) = Var(v_i) = \sigma_u^2\)

\(EE_2\)

  • \(R_j^2 = 0\)
  • \(Var(error) = Var(u_i) = \sigma_u^2\)

Variance formula

\(Var(\widehat{\beta}_j)= \frac{Var(error)}{SST_j(1-R^2_j)}\)



Answer They are the same because all the components are the same.
  • If you include an irrelevant variable that has no explanatory power beyond \(x_1\) and is not correlated with \(x_1\) (\(EE_2\)), then the variance of the OLS estimator on \(x_1\) will be the same as when you do not include \(x_2\) as a covariate (\(EE_1\))

  • If you omit an irrelevant variable that has no explanatory power beyond \(x_1\) (\(EE_1\)) and is not correlated with \(x_1\), then the the OLS estimator on \(x_1\) is still unbiased

Case 2

True Model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) \ne 0\)
  • \(\beta_2=0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Example

\(\mbox{corn yield} = \beta_0 + \beta_1 \times N + \beta_2 \mbox{farmers' height} + u\)

We will estimate the following models:


\(EE_1\): \(y_i=\beta_0 + \beta_1 x_{1,i} + v_i \mbox{ , where } (v_i = \beta_2 x_{2,i} + u_i)\)

\(EE_2\): \(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

(Only \(x_1\) is included in \(EE_1\), while \(x_1\) and \(x_2\) are included in \(EE_2\))


Question

What do you think is gonna happen? Any guess?

  • \(E[\widehat{\beta}_1]=\beta_1\) in \(EE_1\)? (bias?)
  • \(E[\widehat{\beta}_1]=\beta_1\) in \(EE_2\)? (bias?)
  • \(Var(\widehat{\beta}_1)\) in \(EE_2\) \(\gtreqqless\) \(Var(\widehat{\beta}_1)\) in \(EE_2\)?

Set up simulations:


Run MC simulations:


Visualize the results:

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) \ne 0\)
  • \(\beta_2=0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Estimated model

\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (v_i = \beta_2 x_{2,i} + u_{i})\)


\(E[v_i|x_{1,i}]=0?\)


Answer

Yes, because

  • \(x_1\) is correlated with \(x_2\), but \(\beta_2 = 0\).
  • \(x_1\) is not correlated with \(u\)
So, no bias.

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) \ne 0\)
  • \(\beta_2=0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Estimated model

\(EE_2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)


\(E[u_i|x_{1,i},x_{2,i}]=0\)?


Answer Yes, because \(x_1\) and \(x_2\) are not correlated with \(u\) (by assumption). So, no bias.

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) \ne 0\)
  • \(\beta_2=0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Variance

\(Var(\widehat{\beta}_j)= \frac{\sigma_v^2}{SST_j(1-R^2_j)}\)

where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.

The estimated model

\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)


\(R_j^2\)?


Answer 0 because there are no other variables included in the model.

\(Var(v_i) = Var(\beta_2 x_i + u_i)\)?


Answer

\[ Var(u_i) = Var(\beta_2 x_i + u_i) = \sigma_u^2 \] because \(\beta_2 = 0\).

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) \ne 0\)
  • \(\beta_2=0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Variance

\(Var(\widehat{\beta}_j)= \frac{\sigma_u^2}{SST_j(1-R^2_j)}\)

where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.

The estimated model

\(EE_2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)


\(R_j^2\)?


Answer

\(R_j^2\) is non-zero because \(x_1\) and \(x_2\) are correlated. If you regress \(x_1\) on \(x_2\), then its \(R^2\) is non-zero.

\(Var(u_i)\)?


Answer \[ Var(u_i) = \sigma_u^2 \]

Question

\(Var(\widehat{\beta}_1)\) in \(EE_1\) \(\gtreqqless\) \(Var(\widehat{\beta}_1)\) in \(EE_2\)?


\(EE_1\)

  • \(R_j^2 = 0\)
  • \(Var(error) = Var(v_i) = \sigma_u^2\)

\(EE_2\)

  • \(R_j^2 > 0\)
  • \(Var(error) = Var(u_i) = \sigma_u^2\)

Variance formula

\(Var(\widehat{\beta}_j)= \frac{Var(error)}{SST_j(1-R^2_j)}\)



Answer So, \(Var(\widehat{\beta}_1)\) in \(EE_1\) \(<\) \(Var(\widehat{\beta}_1)\) in \(EE_2\)
  • If you include an irrelevant variable that has no explanatory power beyond \(x_1\), but is highly correlated with \(x_1\) (\(EE_2\)), then the variance of the OLS estimator on \(x_1\) is larger compared to when you do not include \(x_2\) (\(EE_1\))

  • If you omit an irrelevant variable that has no explanatory power beyond \(x_1\) (\(EE_1\)), but is highly correlated with \(x_1\), then the the OLS estimator on \(x_1\) is still unbiased

Case 3

True Model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) = 0\)
  • \(\beta_2 \ne 0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Example

\(\mbox{corn yield} = \beta_0 + \beta_1 \times N + \beta_2 \mbox{farmers' height} + u\)

We will estimate the following models:


\(EE_1\): \(y_i=\beta_0 + \beta_1 x_{1,i} + v_i \mbox{ , where } (v_i = \beta_2 x_{2,i} + u_i)\)

\(EE_2\): \(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

(Only \(x_1\) is included in \(EE_1\), while \(x_1\) and \(x_2\) are included in \(EE_2\))


Question

What do you think is gonna happen? Any guess?

  • \(E[\widehat{\beta}_1]=\beta_1\) in \(EE_1\)? (bias?)
  • \(E[\widehat{\beta}_1]=\beta_1\) in \(EE_2\)? (bias?)
  • \(Var(\widehat{\beta}_1)\) in \(EE_2\) \(\gtreqqless\) \(Var(\widehat{\beta}_1)\) in \(EE_2\)?

Set up simulations:


Run MC simulations:


Visualize the results:

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) = 0\)
  • \(\beta_2 \ne 0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Estimated model

\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (v_i = \beta_2 x_{2,i} + u_{i})\)


\(E[v_i|x_{1,i}]=0?\)


Answer

Yes, because \(x_1\) is not correlated with either \(x_2\) or \(u\).

So, no bias.

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) = 0\)
  • \(\beta_2 \ne 0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Estimated model

\(EE_2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)


\(E[u_i|x_{1,i},x_{2,i}]=0\)?


Answer Yes, because \(x_1\) and \(x_2\) are not correlated with \(u\) (by assumption). So, no bias.

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) = 0\)
  • \(\beta_2 \ne 0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Variance

\(Var(\widehat{\beta}_j)= \frac{\sigma_v^2}{SST_j(1-R^2_j)}\)

where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.

The estimated model

\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)


\(R_j^2\)?


Answer 0 because there are no other variables included in the model.

\(Var(v_i) = Var(\beta_2 x_i + u_i)\)?


Answer \[\begin{align} Var(error) & = Var(v_i) \\ & = Var(\beta_2 x_i + u_i) \\ & = \beta_2^2\cdot Var(x_i) + \sigma_u^2 \end{align}\]

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) = 0\)
  • \(\beta_2 \ne 0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Variance

\(Var(\widehat{\beta}_j)= \frac{\sigma_u^2}{SST_j(1-R^2_j)}\)

where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.

The estimated model

\(EE_2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)


\(R_j^2\)?


Answer

\(R_j^2\) is (on average) zero because \(x_1\) and \(x_2\) are not correlated. If you regress \(x_1\) on \(x_2\), then its \(R^2\) is zero (on average).

\(Var(u_i)\)?


Answer \[ Var(error) = Var(u_i) = \sigma_u^2 \]

Question

\(Var(\widehat{\beta}_1)\) in \(EE_1\) \(\gtreqqless\) \(Var(\widehat{\beta}_1)\) in \(EE_2\)?


\(EE_1\)

  • \(R_j^2 = 0\)
  • \(Var(error) = Var(v_i) = \beta_2^2\cdot Var(x_i) + \sigma_u^2\)

\(EE_2\)

  • \(R_j^2 = 0\)
  • \(Var(error) = Var(u_i) = \sigma_u^2\)

Variance formula

\(Var(\widehat{\beta}_j)= \frac{Var(error)}{SST_j(1-R^2_j)}\)



Answer So, \(Var(\widehat{\beta}_1)\) in \(EE_1\) \(>\) \(Var(\widehat{\beta}_1)\) in \(EE_2\)
  • If you include a variable that has some explanatory power beyond \(x_1\), but is not correlated with \(x_1\) (\(EE_2\)), then the variance of the OLS estimator on \(x_1\) is smaller compared to when you do not include \(x_2\) (\(EE_1\))

  • If you omit an variable that has some explanatory power beyond \(x_1\) (\(EE_1\)), but is not correlated with \(x_1\), then the the OLS estimator on \(x_1\) is still unbiased

Case 4

True Model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) \ne 0\)
  • \(\beta_2 \ne 0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Example

\(\mbox{corn yield} = \beta_0 + \beta_1 \times N + \beta_2 \mbox{farmers' height} + u\)

We will estimate the following models:


\(EE_1\): \(y_i=\beta_0 + \beta_1 x_{1,i} + v_i \mbox{ , where } (v_i = \beta_2 x_{2,i} + u_i)\)

\(EE_2\): \(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

(Only \(x_1\) is included in \(EE_1\), while \(x_1\) and \(x_2\) are included in \(EE_2\))


Question

What do you think is gonna happen? Any guess?

  • \(E[\widehat{\beta}_1]=\beta_1\) in \(EE_1\)? (bias?)
  • \(E[\widehat{\beta}_1]=\beta_1\) in \(EE_2\)? (bias?)
  • \(Var(\widehat{\beta}_1)\) in \(EE_2\) \(\gtreqqless\) \(Var(\widehat{\beta}_1)\) in \(EE_2\)?

Set up simulations:


Run MC simulations:


Visualize the results:

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) \ne 0\)
  • \(\beta_2 \ne 0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Estimated model

\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (v_i = \beta_2 x_{2,i} + u_{i})\)


\(E[v_i|x_{1,i}]=0?\)


Answer

No, because \(x_1\) is correlated with \(x_2\) and \(\beta_2 \ne 0\).

So, there will be bias.

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) \ne 0\)
  • \(\beta_2 \ne 0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Estimated model

\(EE_2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)


\(E[u_i|x_{1,i},x_{2,i}]=0\)?


Answer Yes, because \(x_1\) and \(x_2\) are not correlated with \(u\) (by assumption). So, no bias.

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) \ne 0\)
  • \(\beta_2 \ne 0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Variance

\(Var(\widehat{\beta}_j)= \frac{\sigma_v^2}{SST_j(1-R^2_j)}\)

where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.

The estimated model

\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)


\(R_j^2\)?


Answer 0 because there are no other variables included in the model.

\(Var(v_i) = Var(\beta_2 x_i + u_i)\)?


Answer \[\begin{align} Var(error) & = Var(v_i) \\ & = Var(\beta_2 x_i + u_i) \\ & = \beta_2^2\cdot Var(x_i) + \sigma_u^2 \end{align}\]

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)

  • \(cor(x_1,x_2) \ne 0\)
  • \(\beta_2 \ne 0\)
  • \(E[u_i|x_{1,i},x_{2,i}]=0\)

Variance

\(Var(\widehat{\beta}_j)= \frac{\sigma_u^2}{SST_j(1-R^2_j)}\)

where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.

The estimated model

\(EE_2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)


\(R_j^2\)?


Answer

\(R_j^2\) is non-zero because \(x_1\) and \(x_2\) are correlated. If you regress \(x_1\) on \(x_2\), then its \(R^2\) is non-zero.

\(Var(u_i)\)?


Answer \[ Var(error) = Var(u_i) = \sigma_u^2 \]

Question

\(Var(\widehat{\beta}_1)\) in \(EE_1\) \(\gtreqqless\) \(Var(\widehat{\beta}_1)\) in \(EE_2\)?


\(EE_1\)

  • \(R_j^2 = 0\)
  • \(Var(error) = Var(v_i) = \beta_2^2\cdot Var(x_i) + \sigma_u^2\)

\(EE_2\)

  • \(R_j^2 \ne 0\)
  • \(Var(error) = Var(u_i) = \sigma_u^2\)

Variance formula

\(Var(\widehat{\beta}_j) = \frac{Var(error)}{SST_j(1-R^2_j)}\)



Answer It depends.

In the MC simulations we saw,

  • \(x_1\) and \(x_2\) are highly correlated, so \(R_j^2\) is very high for \(EE_2\)
x1 <- 0.1 * rnorm(N) + 0.9 * mu # independent variable
x2 <- 0.1 * rnorm(N) + 0.9 * mu # independent variable


  • The impact of \(x_2\) (\(\beta_2 = 1\)) and the variance of \(x_2\) is small (approximately 1).
y <- 1 + x1 + 1 * x2 + u


These conditions led to lower \(Var(\widehat{\beta}_1)\) in \(EE_1\) compared to \(Var(\widehat{\beta}_1)\) in \(EE_2\).

Now, let’s reverse the current conditions. We now have:

  • \(x_1\) and \(x_2\) are NOT highly correlated, so \(R_j^2\) is small for \(EE_2\)
  • The impact of \(x_2\) (\(\beta_2 = 5\)) and the variance of \(x_2\) is large (approximately 5).
x1 <- 0.9 * rnorm(N) + 0.1 * mu # independent variable
x2 <- 2.23 * rnorm(N) + 0.1 * mu # independent variable
cor(x1, x2)


y <- 1 + x1 + 5 * x2 + u


Let’s rerun MC simulations with this updated data generating process.

  • There exists bias-variance trade-off when independent variables are both important (their coefficients are non-zero) and they are correlated

  • Economists tend to opt for unbiasedness

Omitted Variable Bias (Theory)

True model

\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)


\(EE_1\)

\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)

Let \(\tilde{\beta}_1\) denote the estimator of \(\beta_1\) from this model


\(EE_2\)

\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)

Let \(\widehat{\beta}_1\) and \(\widehat{\beta}_2\) denote the estimator of \(\beta_1\) and \(\beta_2\)


Relationship between \(x_1\) and \(x_2\)

\(x_{1,i} = \sigma_0 + \sigma_1 x_{2,i} + \mu_{i}\)


Important

Then, \(E[\tilde{\beta}_1] = \beta_1 + \beta_2 \cdot \sigma_1\), where \(\beta_2 \cdot \sigma_1\) is the bias.

That is, if you omit \(x_2\) and regress \(y\) only on \(x_1\), then the bias is going to be the multiple of the impact of \(x_2\) on \(y\) (\(\beta_2\)) and the impact of \(x_2\) on \(x_1\) (\(\sigma_1\)).

Direction of bias

  • \(Cor(x_1, x_2) > 0\) and \(\beta_2 >0\), then \(bias > 0\)
  • \(Cor(x_1, x_2) > 0\) and \(\beta_2 <0\), then \(bias < 0\)
  • \(Cor(x_1, x_2) < 0\) and \(\beta_2 >0\), then \(bias < 0\)
  • \(Cor(x_1, x_2) < 0\) and \(\beta_2 <0\), then \(bias > 0\)


Magnitude of bias

  • The greater the correlation between \(x_1\) and \(x_2\), the greater the bias

  • The greater \(\beta_1\) is, the greater the bias

\[ \begin{aligned} \mbox{corn yield} = \alpha + \beta \cdot N + (\gamma \cdot \mbox{soil erodability} + \mu) \end{aligned} \]

  • Famers tend to apply more nitrogen to the field that is more erodible to compensate for loss of nutrient due to erosion
  • Soil erodability affects corn yield negatively \((\gamma < 0)\)

What is the direction of bias on \(\hat{\beta}\)?

\[ \begin{aligned} \mbox{house price} = \alpha + \beta \cdot \mbox{dist to incinerators} + (\gamma \cdot \mbox{dist to city center} + \mu) \end{aligned} \]

  • The city planner placed incinerators in the outskirt of a city to avoid their potentially negative health effects
  • Distance to city center has a negative impact on house price \((\gamma < 0)\)

What is the direction of bias on \(\hat{\beta}\)?

\[ \begin{aligned} \mbox{groundwater use} = \alpha + \beta \cdot \mbox{precipitation} + (\gamma \cdot \mbox{center pivot} + \mu) \end{aligned} \]

\(\mbox{groundwater use}\): groundwater use by a farmer for irrigated production

\(\mbox{center pivot}\): 1 if center pivot is used, 0 if flood irrigation (less effective) is used

  • Farmers who have relatively low precipitation during the growing season tend to adopt center pivot more
  • center pivot applied water more efficiently than flood irrigation \((\gamma < 0)\)

What is the direction of bias on \(\hat{\beta}\)?

When the direction of the bias is the opposite of the expected coefficient on the variable of interest, you can claim that even after suffering from the bias, you are still seeing the impact of the variable interest. So, it is a strong evidence that you would have had an even stronger estimated impact.

\[ \begin{aligned} \mbox{groundwater use} = \alpha + \beta \cdot \mbox{precipitation} + (\gamma \cdot \mbox{center pivot} + \mu) \end{aligned} \]

  • The true \(\beta\) is \(-10\) ( you do not observe this )
  • The bias on \(\widehat{\beta}\) is \(5\) ( you do not observe this )
  • \(\widehat{\beta}\) is \(-5\) ( you only observe this )

You believe the direction of bias is positive (you need provide reasoning behind your belief), and yet, the estimated coefficient is still negative. So, you can be quite confident that the sign of the impact of precipitation is negative. You can say your estimate is a conservative estimate of the impact of precipitation on groundwater use.

\[ \begin{aligned} \mbox{house price} = \alpha + \beta \cdot \mbox{dist to incinerators} + (\gamma \cdot \mbox{dist to city center} + \mu) \end{aligned} \]

  • The true \(\beta\) is \(-10\) ( you do not observe this )
  • The bias on \(\widehat{\beta}\) is \(-5\) ( you do not observe this )
  • \(\widehat{\beta}\) is \(-15\) ( you only observe this )

You believe the direction of bias is negative, and the estimated coefficient is negative. So, unlike the case above, you cannot be confident that \(\widehat{\beta}\) would have been negative if it were not for the bias (by observing dist to city center and include it as a covariate). It is very much possible that the degree of bias is so large that the estimated coefficient turns negative even though the true sign of \(\beta\) is positive. In this case, there is nothing you can do.