You often
face the decision of whether you should be including a particular variable or not: how do you make a right decision?
miss a variable that you know is important because it is not simply available: what are the consequences?
Two important concepts you need to be aware of:
Definition: Multicollinearity
A phenomenon where two or more variables are highly correlated (negatively or positively) with each other ( consequences? )
Definition: Omitted Variable Bias
Bias caused by not including (omitting) important variables in the model
Consider the following model,
\[ y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i \]
Your interest is in estimating the impact of \(x_1\) on \(y\).
Objective
Using this simple model, we investigate what happens to the coefficient estimate on \(x_1\) if you include/omit \(x_2\).
The model: \[y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\]
Case 1:
What happens if \(\beta_2=0\), but include \(x_2\) that is not correlated with \(x_1\)?
Case 2:
What happens if \(\beta_2=0\), but include \(x_2\) that is highly correlated with \(x_1\)?
Case 3:
What happens if \(\beta_2\ne 0\), but omit \(x_2\) that is not correlated with \(x_1\)?
Case 4:
What happens if \(\beta_2\ne 0\), but omit \(x_2\) that is highly correlated with \(x_1\)?
Is \(\widehat{\beta}_1\) unbiased, that is \(E[\widehat{\beta}_1]=\beta_1\)?
\(Var(\widehat{\beta}_1)\)? (how accurate the estimation of \(\widehat{\beta}_1\) is)
True Model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Example
\(\mbox{corn yield} = \beta_0 + \beta_1 \times N + \beta_2 \mbox{farmers' height} + u\)
We will estimate the following models:
\(EE_1\): \(y_i=\beta_0 + \beta_1 x_{1,i} + v_i \mbox{ , where } (v_i = \beta_2 x_{2,i} + u_i)\)
\(EE_2\): \(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
(Only \(x_1\) is included in \(EE_1\), while \(x_1\) and \(x_2\) are included in \(EE_2\).)
Question
What do you think is gonna happen? Any guess?
Set up simulations:
Run MC simulations:
Visualize the results:
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Estimated model
\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (v_i = \beta_2 x_{2,i} + u_{i})\)
\(E[v_i|x_{1,i}]=0?\)
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Estimated model
\(EE_2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)
\(E[u_i|x_{1,i},x_{2,i}]=0\)?
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Variance
\(Var(\widehat{\beta}_j)= \frac{\sigma_v^2}{SST_j(1-R^2_j)}\)
where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.
The estimated model
\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)
\(R_j^2\)?
\(Var(v_i) = Var(\beta_2 x_i + u_i)\)?
\[ Var(u_i) = Var(\beta_2 x_i + u_i) = \sigma_u^2 \] because \(\beta_2 = 0\).
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Variance
\(Var(\widehat{\beta}_j)= \frac{\sigma_u^2}{SST_j(1-R^2_j)}\)
where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.
The estimated model
\(EE2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)
\(R_j^2\)?
\(Var(u_i)\)?
Question
\(Var(\widehat{\beta}_1)\) in \(EE_1\) \(\gtreqqless\) \(Var(\widehat{\beta}_1)\) in \(EE_2\)?
\(EE_1\)
\(EE_2\)
Variance formula
\(Var(\widehat{\beta}_j)= \frac{Var(error)}{SST_j(1-R^2_j)}\)
If you include an irrelevant variable that has no explanatory power beyond \(x_1\) and is not correlated with \(x_1\) (\(EE_2\)), then the variance of the OLS estimator on \(x_1\) will be the same as when you do not include \(x_2\) as a covariate (\(EE_1\))
If you omit an irrelevant variable that has no explanatory power beyond \(x_1\) (\(EE_1\)) and is not correlated with \(x_1\), then the the OLS estimator on \(x_1\) is still unbiased
True Model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Example
\(\mbox{corn yield} = \beta_0 + \beta_1 \times N + \beta_2 \mbox{farmers' height} + u\)
We will estimate the following models:
\(EE_1\): \(y_i=\beta_0 + \beta_1 x_{1,i} + v_i \mbox{ , where } (v_i = \beta_2 x_{2,i} + u_i)\)
\(EE_2\): \(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
(Only \(x_1\) is included in \(EE_1\), while \(x_1\) and \(x_2\) are included in \(EE_2\))
Question
What do you think is gonna happen? Any guess?
Set up simulations:
Run MC simulations:
Visualize the results:
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Estimated model
\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (v_i = \beta_2 x_{2,i} + u_{i})\)
\(E[v_i|x_{1,i}]=0?\)
Yes, because
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Estimated model
\(EE_2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)
\(E[u_i|x_{1,i},x_{2,i}]=0\)?
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Variance
\(Var(\widehat{\beta}_j)= \frac{\sigma_v^2}{SST_j(1-R^2_j)}\)
where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.
The estimated model
\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)
\(R_j^2\)?
\(Var(v_i) = Var(\beta_2 x_i + u_i)\)?
\[ Var(u_i) = Var(\beta_2 x_i + u_i) = \sigma_u^2 \] because \(\beta_2 = 0\).
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Variance
\(Var(\widehat{\beta}_j)= \frac{\sigma_u^2}{SST_j(1-R^2_j)}\)
where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.
The estimated model
\(EE_2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)
\(R_j^2\)?
\(R_j^2\) is non-zero because \(x_1\) and \(x_2\) are correlated. If you regress \(x_1\) on \(x_2\), then its \(R^2\) is non-zero.
\(Var(u_i)\)?
Question
\(Var(\widehat{\beta}_1)\) in \(EE_1\) \(\gtreqqless\) \(Var(\widehat{\beta}_1)\) in \(EE_2\)?
\(EE_1\)
\(EE_2\)
Variance formula
\(Var(\widehat{\beta}_j)= \frac{Var(error)}{SST_j(1-R^2_j)}\)
If you include an irrelevant variable that has no explanatory power beyond \(x_1\), but is highly correlated with \(x_1\) (\(EE_2\)), then the variance of the OLS estimator on \(x_1\) is larger compared to when you do not include \(x_2\) (\(EE_1\))
If you omit an irrelevant variable that has no explanatory power beyond \(x_1\) (\(EE_1\)), but is highly correlated with \(x_1\), then the the OLS estimator on \(x_1\) is still unbiased
True Model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Example
\(\mbox{corn yield} = \beta_0 + \beta_1 \times N + \beta_2 \mbox{farmers' height} + u\)
We will estimate the following models:
\(EE_1\): \(y_i=\beta_0 + \beta_1 x_{1,i} + v_i \mbox{ , where } (v_i = \beta_2 x_{2,i} + u_i)\)
\(EE_2\): \(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
(Only \(x_1\) is included in \(EE_1\), while \(x_1\) and \(x_2\) are included in \(EE_2\))
Question
What do you think is gonna happen? Any guess?
Set up simulations:
Run MC simulations:
Visualize the results:
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Estimated model
\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (v_i = \beta_2 x_{2,i} + u_{i})\)
\(E[v_i|x_{1,i}]=0?\)
Yes, because \(x_1\) is not correlated with either \(x_2\) or \(u\).
So, no bias.True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Estimated model
\(EE_2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)
\(E[u_i|x_{1,i},x_{2,i}]=0\)?
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Variance
\(Var(\widehat{\beta}_j)= \frac{\sigma_v^2}{SST_j(1-R^2_j)}\)
where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.
The estimated model
\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)
\(R_j^2\)?
\(Var(v_i) = Var(\beta_2 x_i + u_i)\)?
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Variance
\(Var(\widehat{\beta}_j)= \frac{\sigma_u^2}{SST_j(1-R^2_j)}\)
where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.
The estimated model
\(EE_2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)
\(R_j^2\)?
\(R_j^2\) is (on average) zero because \(x_1\) and \(x_2\) are not correlated. If you regress \(x_1\) on \(x_2\), then its \(R^2\) is zero (on average).
\(Var(u_i)\)?
Question
\(Var(\widehat{\beta}_1)\) in \(EE_1\) \(\gtreqqless\) \(Var(\widehat{\beta}_1)\) in \(EE_2\)?
\(EE_1\)
\(EE_2\)
Variance formula
\(Var(\widehat{\beta}_j)= \frac{Var(error)}{SST_j(1-R^2_j)}\)
If you include a variable that has some explanatory power beyond \(x_1\), but is not correlated with \(x_1\) (\(EE_2\)), then the variance of the OLS estimator on \(x_1\) is smaller compared to when you do not include \(x_2\) (\(EE_1\))
If you omit an variable that has some explanatory power beyond \(x_1\) (\(EE_1\)), but is not correlated with \(x_1\), then the the OLS estimator on \(x_1\) is still unbiased
True Model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Example
\(\mbox{corn yield} = \beta_0 + \beta_1 \times N + \beta_2 \mbox{farmers' height} + u\)
We will estimate the following models:
\(EE_1\): \(y_i=\beta_0 + \beta_1 x_{1,i} + v_i \mbox{ , where } (v_i = \beta_2 x_{2,i} + u_i)\)
\(EE_2\): \(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
(Only \(x_1\) is included in \(EE_1\), while \(x_1\) and \(x_2\) are included in \(EE_2\))
Question
What do you think is gonna happen? Any guess?
Set up simulations:
Run MC simulations:
Visualize the results:
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Estimated model
\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (v_i = \beta_2 x_{2,i} + u_{i})\)
\(E[v_i|x_{1,i}]=0?\)
No, because \(x_1\) is correlated with \(x_2\) and \(\beta_2 \ne 0\).
So, there will be bias.True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Estimated model
\(EE_2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)
\(E[u_i|x_{1,i},x_{2,i}]=0\)?
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Variance
\(Var(\widehat{\beta}_j)= \frac{\sigma_v^2}{SST_j(1-R^2_j)}\)
where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.
The estimated model
\(EE_1\): \(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)
\(R_j^2\)?
\(Var(v_i) = Var(\beta_2 x_i + u_i)\)?
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
Variance
\(Var(\widehat{\beta}_j)= \frac{\sigma_u^2}{SST_j(1-R^2_j)}\)
where \(R^2_j\) is the \(R^2\) when you regress \(x_j\) on all the other covariates.
The estimated model
\(EE_2\): \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)
\(R_j^2\)?
\(R_j^2\) is non-zero because \(x_1\) and \(x_2\) are correlated. If you regress \(x_1\) on \(x_2\), then its \(R^2\) is non-zero.
\(Var(u_i)\)?
Question
\(Var(\widehat{\beta}_1)\) in \(EE_1\) \(\gtreqqless\) \(Var(\widehat{\beta}_1)\) in \(EE_2\)?
\(EE_1\)
\(EE_2\)
Variance formula
\(Var(\widehat{\beta}_j) = \frac{Var(error)}{SST_j(1-R^2_j)}\)
In the MC simulations we saw,
These conditions led to lower \(Var(\widehat{\beta}_1)\) in \(EE_1\) compared to \(Var(\widehat{\beta}_1)\) in \(EE_2\).
Now, letβs reverse the current conditions. We now have:
Letβs rerun MC simulations with this updated data generating process.
There exists bias-variance trade-off when independent variables are both important (their coefficients are non-zero) and they are correlated
Economists tend to opt for unbiasedness
True model
\(y_i=\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i\)
\(EE_1\)
\(y_i = \beta_0 + \beta_1 x_{1,i} + v_{i} \;\; (\beta_2 x_{2,i} + u_{i})\)
Let \(\tilde{\beta}_1\) denote the estimator of \(\beta_1\) from this model
\(EE_2\)
\(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_{i}\)
Let \(\widehat{\beta}_1\) and \(\widehat{\beta}_2\) denote the estimator of \(\beta_1\) and \(\beta_2\)
Relationship between \(x_1\) and \(x_2\)
\(x_{1,i} = \sigma_0 + \sigma_1 x_{2,i} + \mu_{i}\)
Important
Then, \(E[\tilde{\beta}_1] = \beta_1 + \beta_2 \cdot \sigma_1\), where \(\beta_2 \cdot \sigma_1\) is the bias.
That is, if you omit \(x_2\) and regress \(y\) only on \(x_1\), then the bias is going to be the multiple of the impact of \(x_2\) on \(y\) (\(\beta_2\)) and the impact of \(x_2\) on \(x_1\) (\(\sigma_1\)).
Direction of bias
Magnitude of bias
The greater the correlation between \(x_1\) and \(x_2\), the greater the bias
The greater \(\beta_1\) is, the greater the bias
\[ \begin{aligned} \mbox{corn yield} = \alpha + \beta \cdot N + (\gamma \cdot \mbox{soil erodability} + \mu) \end{aligned} \]
What is the direction of bias on \(\hat{\beta}\)?
\[ \begin{aligned} \mbox{house price} = \alpha + \beta \cdot \mbox{dist to incinerators} + (\gamma \cdot \mbox{dist to city center} + \mu) \end{aligned} \]
What is the direction of bias on \(\hat{\beta}\)?
\[ \begin{aligned} \mbox{groundwater use} = \alpha + \beta \cdot \mbox{precipitation} + (\gamma \cdot \mbox{center pivot} + \mu) \end{aligned} \]
\(\mbox{groundwater use}\): groundwater use by a farmer for irrigated production
\(\mbox{center pivot}\): 1 if center pivot is used, 0 if flood irrigation (less effective) is used
What is the direction of bias on \(\hat{\beta}\)?
When the direction of the bias is the opposite of the expected coefficient on the variable of interest, you can claim that even after suffering from the bias, you are still seeing the impact of the variable interest. So, it is a strong evidence that you would have had an even stronger estimated impact.
\[ \begin{aligned} \mbox{groundwater use} = \alpha + \beta \cdot \mbox{precipitation} + (\gamma \cdot \mbox{center pivot} + \mu) \end{aligned} \]
You believe the direction of bias is positive (you need provide reasoning behind your belief), and yet, the estimated coefficient is still negative. So, you can be quite confident that the sign of the impact of precipitation is negative. You can say your estimate is a conservative estimate of the impact of precipitation on groundwater use.
\[ \begin{aligned} \mbox{house price} = \alpha + \beta \cdot \mbox{dist to incinerators} + (\gamma \cdot \mbox{dist to city center} + \mu) \end{aligned} \]
You believe the direction of bias is negative, and the estimated coefficient is negative. So, unlike the case above, you cannot be confident that \(\widehat{\beta}\) would have been negative if it were not for the bias (by observing dist to city center and include it as a covariate). It is very much possible that the degree of bias is so large that the estimated coefficient turns negative even though the true sign of \(\beta\) is positive. In this case, there is nothing you can do.