Endogeneity

class: center, middle, inverse, title-slide

# Endogeneity
### AECN 396/896-002

---

.remark-slide-content.hljs-github h1 {
  margin-top: 5px;  
  margin-bottom: 25px;  
}

.remark-slide-content.hljs-github {
  padding-top: 10px;  
  padding-left: 30px;  
  padding-right: 30px;  
}

.panel-tabs {
  
  color: #841F27;
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 0px;  
  padding-bottom: 0px;  
}

.panel-tab {
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 3px;  
  margin-right: 3px;  
  padding-top: 0px;  
  padding-bottom: 0px;  
}

.panelset .panel-tabs .panel-tab {
  min-height: 40px;
}

.remark-slide th {
  border-bottom: 1px solid #ddd;
}

.remark-slide thead {
  border-bottom: 0px;
}

.gt_footnote {
  padding: 2px;  
}

.remark-slide table {
  border-collapse: collapse;
}

.remark-slide tbody {
  border-bottom: 2px solid #666;
}

.important {
  background-color: lightpink;
  border: 2px solid blue;
  font-weight: bold;
}

.remark-code {
  display: block;
  overflow-x: auto;
  padding: .5em;
  background: #ffe7e7;
}

.hljs-github .hljs {
  background: #f2f2fd;
}

.remark-inline-code {
  padding-top: 0px;
  padding-bottom: 0px;
  background-color: #e6e6e6;
}

.r.hljs.remark-code.remark-inline-code{
  font-size: 0.9em
}

.left-full {
  width: 80%;
  height: 92%;
  float: left;
}

.left-code {
  width: 38%;
  height: 92%;
  float: left;
}

.right-plot {
  width: 60%;
  float: right;
  padding-left: 1%;
}

.left5 {
  width: 49%;
  height: 92%;
  float: left;
}

.right5 {
  width: 49%;
  float: right;
  padding-left: 1%;
}

.left3 {
  width: 29%;
  height: 92%;
  float: left;
}

.right7 {
  width: 69%;
  float: right;
  padding-left: 1%;
}

.left4 {
  width: 38%;
  height: 92%;
  float: left;
}

.right6 {
  width: 60%;
  float: right;
  padding-left: 1%;
}

ul li{
  margin: 7px;
}

ul, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

ol li{
  margin: 7px;
}

ol, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

</style>

.full-width {
    display: flex;
    width: 100%;
    flex: 1 1 auto;
}
</style>

.blockquote p {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h1:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h2:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h3:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h4:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}

.text-shadow {
  text-shadow: 0 0 4px #424242;
}
</style>

.scroll-box-8 {
  height:8em;
  overflow-y: scroll;
}
.scroll-box-10 {
  height:10em;
  overflow-y: scroll;
}
.scroll-box-12 {
  height:12em;
  overflow-y: scroll;
}
.scroll-box-14 {
  height:14em;
  overflow-y: scroll;
}
.scroll-box-16 {
  height:16em;
  overflow-y: scroll;
}
.scroll-box-18 {
  height:18em;
  overflow-y: scroll;
}
.scroll-box-20 {
  height:20em;
  overflow-y: scroll;
}
.scroll-box-24 {
  height:24em;
  overflow-y: scroll;
}
.scroll-box-30 {
  height:30em;
  overflow-y: scroll;
}
.scroll-output {
  height: 90%;
  overflow-y: scroll;
}

</style>

# Before we start

## Learning objectives

Understand how endogeneity problems arise

## Table of contents

1. [Selection Bias](#select)
2. [Reverse Causality](#reverse)
3. [Measurement Error](#ME)

---
class: middle

# Endogeneity

.content-box-red[**Endogeneity**]
  
`$E[u|x_k] \ne 0$` (the error term is not correlated with any of the independent variables)

.content-box-red[**Endogenous independent variable**]

If the error term is, .red[for whatever reason], correlated with the independent variable `$x_k$`, then we say that `$x_k$` is an endogenous independent variable.

+ Omitted variable 
+ Selection 
+ Reverse causality 
+ Measurement error

---

# Omitted Variable

.content-box-red[**True Model**]

`$log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 ablility + u$`

<br>

.content-box-red[**Incorrectly specified (your) model**]

`$log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper  + v \;\;(u + \beta_3 ablility)$`

---
class: inverse, center, middle
name: select

# Selection Bias

---
class: middle

.content-box-green[**Research Question**]

Does a soil moisture sensor reduce water use for farmers?

.content-box-green[**Data**]

Observational (non-experimental) data on soil moisture sensor adoption and irrigation amount

.content-box-green[**Model of interest**]

`$irrigation  = \beta_0 + \beta_1 sensor + u$`

+ `$irrigation$`: the amount of irrigation by the farmer
+ `$sensor$`: dummy variable that indicates whether the farmer has adopted soil moisture sensor or not

.content-box-green[**Question**]

Is `$sensor$` endogenous (is `$sensor$` correlated with the error term)?

---
class: middle

Farmers do not just randomly adopt a soil moisture sensor, they consider available information to determine it is beneficial for them to adopt it or not.

.content-box-green[**Adoption (selection) equation**]

`$sensor  = \beta_0 + \beta_1 x_2 + \dots + \beta_k x_k + v$`

.content-box-green[**Question**]

What would be variables that farmers look at when they decide whether they should get a soil moisture sensor or not?

.content-box-green[**Question**]

Are any of the variables listed above also affect irrigation demand?

---
class: middle

.content-box-green[**Example**]

Soil quality/type (hard to accurately measure)

+ farmers whose fields are sandy are more likely to adopt a soil moisture sensor (this is just a conjecture)
+ farmers whose fields are sandy are likely to use more water

<br>

.content-box-green[**Key**]

Soil quality/type affect .red[both] the decision of soil moisture sensor adoption and irrigation.

+ `$sensor$` is a function of soil quality/type
+ `$irrigation$` is a function of soil quality/type, which is in the error term uncontrolled for

---
class: middle

$$
`\begin{aligned}
irrigation  = \beta_0 + \beta_1 sensor(\mbox{soil type}) + u\;\;(= \beta_s \mbox{soil type} + v)
\end{aligned}`
$$
where `$v$` include all the unobservable variables except soil type.

So, `$sensor$` and the error term in the irrigation model are correlated through soil type, leading to biased estimation of the impact of a sensor.

---
class: middle

.content-box-green[**Important**]

Selection bias is a form of omitted variable bias

.content-box-green[**Reasoning**]

If you accurately measure the common factors in the two equations, you can simply include them explicitly in the main model.

For example,

$$
`\begin{aligned}
irrigation  = \beta_0 + \beta_1 sensor(\mbox{soil type}) + \beta_s \mbox{soil type} + u
\end{aligned}`
$$

This will get the common factor (soil type) out of the error in the main model, which means the adoption variable and the error term are no longer correlated in the main model.

.content-box-green[**Note**]

We call this kind of omitted variable bias ".red[selection bias]" because of the underlying mechanism through which the variable of interest (here, adoption of a technology) is made endogenous.

+ In this example, farmers self-selected into the adoption of a soil moisture sensor

---
class: inverse, center, middle
name: reverse

# Reverse Causality

---
class: middle

.content-box-green[**Research Question**]

Does a particular type of medical treatment improve health?

.content-box-green[**Data**]

Observational (non-experimental) .red[cross-sectional] data on a particular type of medical treatment and health

.content-box-green[**Model**]

`$health  = \beta_0 + \beta_1 treatment + u$`

+ health: indicator of the health of patients
+ treatment: dummy variable that indicates whether the patient is treated or not

This model basically compares the health of patients who have and have not had the treatment (no before-after comparison, yes this is dumb).

.content-box-green[**Question**]

Is `$treatment$` endogenous? (Is `$treatment$` correlated with the error term?)

---
class: middle

.content-box-green[**Key**]

Whether patients get the treatment or not is not randomized, rather it is determined by doctors (like in the real world).

.content-box-green[**Question**]

How do doctors decide whether to put their patients under a medical treatment?

.content-box-green[**Answer**]

Patients' health condition!!!

---
class: middle

.content-box-green[**Selection (treatment decision) model**]

`$\mbox{treatment}  = \beta_0 + \beta_1 \mbox{health} + u$`

+ `treatment` is affected by `health`
+ `health` is affected by `treatment`

.content-box-green[**Consequence**]

`treatment` is endogenous because it is a function of health itself!

.content-box-green[**Reverse Causality**]

This type of endogeneity problem is called .red[reverse causality] because the independent variable of interest is causally affected by the dependent variable even though your interest is in the estimation of the impact of the independent variable on the dependent variable.

---
class: middle

# Another reverse causality example

.content-box-green[**Context**]

+ Under the Clean Water Act, some of those who discharge wastes into water (e.g., oil refinery) need to comply with water quality criteria of their discharges set under the law.

+ EPA (Environmental Protection Agency) can take enforcement actions (e.g., financial penalties) to those who violate the requirements.

.content-box-green[**Research Question**]

Are enforcement actions effective in improving the water quality of waster discharges?

.content-box-green[**Data**]

Annual data on

+ water quality measures of waster discharges by individual firms 
+ enforcement actions taken on firms by EPA

---
class: middle

.content-box-green[**Model of Interest**]

`$\mbox{water quality}  = \beta_0 + \beta_1 \mbox{enforcement actions} + u$`

.content-box-green[**Selection (enforcement decision) model**]

`$\mbox{enforcement actions}  = \beta_0 + \beta_1 \mbox{water quality} + u$`

+ `water quality` is affected by `enforcement actions`
+ `enforcement actions` is affected by `water quality`

.content-box-green[**Consequence**]

`enforcement actions` is endogenous because it is a function of water quality itself!

---

class: inverse, center, middle
name: ME

# Measurement Error

---
class: middle

# Measurement Error (ME)

.content-box-green[**Definition**]

Inaccuracy in the values observed as opposed to the actual values

<br>

.content-box-green[**Examples**]

+ reporting errors (any kind of survey has the potential of mis-reporting)
  * household survey on income and savings
  * survey on rice yield by farmers in developing countries
+ the use of estimated values
  + spatially interpolated weather conditions (precipitation)
  + imputed irrigation costs

<br>

.content-box-green[**Question**]

What are the consequences of having measurement errors in variables you use in regression?

---
class: middle

# ME in the Dependent Variable

.content-box-green[**True Model**]

`$y^*=  \beta_0 + \beta_1 x_1 + \dots + \beta_k x_k + u$`

with MLR.1 through MLR.6 satisfied `$(u$` is not correlated with any of the independent variables).

<br>

.content-box-green[**Measurement Errors**]

The difference between the observed `$(y)$` and actual values `$y^*$`

`$e = y-y^*$`

<br>

.content-box-green[**Estimable Model**]

Plugging the second equation into the first equation, your model is

`$y =  \beta_0 + \beta_1 x_1 + \dots + \beta_k x_k + v, \;\;\mbox{where}\;\; v = (u + e)$`

---
class: middle

.content-box-green[**Estimable Model**]

Plugging the second equation into the first equation, your model is

`$y =  \beta_0 + \beta_1 x_1 + \dots + \beta_k x_k + v, \;\;\mbox{where}\;\; v = (u + e)$`

<br>

.content-box-green[**Question**]

What are the conditions under which OLS estimators are unbiased?

<br>

.content-box-green[**Answer**]

`$E[e|x_1, \dots, x_k] = 0$`

So, as long as the measurement error is uncorrelated with the independent variables, OLS estimators are still unbiased.

---
class: middle

# ME in Independent Variables

.content-box-green[**True Model**]

Consider the following general model

`$y =  \beta_0 + \beta_1 x_1^* + u$`

with MLR.1 through MLR.6 satisfied.

.content-box-green[**Measurement Errors**]

The difference between the observed `$(x_1)$` and actual values `$(x_1^*)$`

`$e_1 = x_1-x_1^*$`

.content-box-green[**Estimable Model**]

Plugging the second equation into the first equation,

`$y =  \beta_0 + \beta_1 x_1 + v, \;\;\mbox{where}\;\; v = (u - \beta e_1)$`

---
class: middle

.content-box-green[**Estimable Model**]

Plugging the second equation into the first equation,

`$y =  \beta_0 + \beta_1 x_1 + v, \;\;\mbox{where}\;\; v = (u - \beta e_1)$`

.content-box-green[**Question**]

What are the conditions under which OLS estimators are unbiased?

.content-box-green[**Answer**]

`$E[e_1|x_1] = 0$`

Unfortunately, this never holds.

---
class: middle

.content-box-green[**Classical errors-in-variables (CEV)**]
  
The correctly observed variable `$(x_1^*)$` is uncorrelated with the measurement error `$(e_1)$`:

`$Cov(x_1^*, e_1) = 0$`

.content-box-green[**Under CEV**]

The incorrectly observed variable `$(x_1)$` must be correlated with the measurement error `$(e_1)$`:

`$Cov(x_1,e_1) = E[x_1 e_1]-E[x_1]E[e_1]$`

`$\quad\quad\quad\quad\quad = E[(x_1^*+e_1)e_1]-E[x_1^*+e_1)]E[e_1]$`

`$\quad\quad\quad\quad\quad = E[x_1^*e_1+e_1^2]-E[x_1^*+e_1)]E[e_1]$`

`$\quad\quad\quad\quad\quad = \sigma_{e_1}^2 =\sigma_{e_1}^2$`

So, the mis-measured variable `$(x_1)$` is always correlated with the measurement error `$(e_1)$`.

---
class: middle

.content-box-green[**Question**]

So, what is the direction of the bias?

.content-box-green[**Note**]

The sign of the bias on `$x_1$` is the sign of the correlation between `$x_1$` and `$v = (u - \beta e_1)$`.

.content-box-green[**Bias?**]

+ Correlation between `$x_1$` and `$u$` is zero

+ The sign of the correlation between `$x_1$` and `$e_1$` is positive (see the previous slide), which means that the sign of the correlation between `$x_1$` and `$- \beta e_1$` is the sign of `$- \beta$`.
  * if `$\beta > 0$`, then the sign of the bias is negative 
  * if `$\beta < 0$`, then the sign of the bias is positive

---
class: middle

.content-box-green[**Bias?**]

+ Correlation between `$x_1$` and `$u$` is zero

.content-box-red[**Attenuation Bias**]

+ So, the bias is such that your estimate of the coefficient on `$x_1$` is biased toward 0.

+ In other words, your estimated impact of a mis-measured independent variable will look less influential than it actually is

(Imagine you mislabeled the treatment status of your experiment)