Data Generating Process, Variation, and Identification

# Data Generating Process, Variation, and Identification
### AECN 396/896-002

---

.remark-slide-content.hljs-github h1 {
  margin-top: 5px;  
  margin-bottom: 25px;  
}

.remark-slide-content.hljs-github {
  padding-top: 10px;  
  padding-left: 30px;  
  padding-right: 30px;  
}

.panel-tabs {
  
  color: #841F27;
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 0px;  
  padding-bottom: 0px;  
}

.panel-tab {
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 3px;  
  margin-right: 3px;  
  padding-top: 0px;  
  padding-bottom: 0px;  
}

.panelset .panel-tabs .panel-tab {
  min-height: 40px;
}

.remark-slide th {
  border-bottom: 1px solid #ddd;
}

.remark-slide thead {
  border-bottom: 0px;
}

.gt_footnote {
  padding: 2px;  
}

.remark-slide table {
  border-collapse: collapse;
}

.remark-slide tbody {
  border-bottom: 2px solid #666;
}

.important {
  background-color: lightpink;
  border: 2px solid blue;
  font-weight: bold;
}

.remark-code {
  display: block;
  overflow-x: auto;
  padding: .5em;
  background: #ffe7e7;
}

.hljs-github .hljs {
  background: #f2f2fd;
}

.remark-inline-code {
  padding-top: 0px;
  padding-bottom: 0px;
  background-color: #e6e6e6;
}

.r.hljs.remark-code.remark-inline-code{
  font-size: 0.9em
}

.left-full {
  width: 80%;
  height: 92%;
  float: left;
}

.left-code {
  width: 38%;
  height: 92%;
  float: left;
}

.right-plot {
  width: 60%;
  float: right;
  padding-left: 1%;
}

.left5 {
  width: 49%;
  height: 92%;
  float: left;
}

.right5 {
  width: 49%;
  float: right;
  padding-left: 1%;
}

.left3 {
  width: 29%;
  height: 92%;
  float: left;
}

.right7 {
  width: 69%;
  float: right;
  padding-left: 1%;
}

.left4 {
  width: 38%;
  height: 92%;
  float: left;
}

.right6 {
  width: 60%;
  float: right;
  padding-left: 1%;
}

ul li{
  margin: 7px;
}

ul, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

ol li{
  margin: 7px;
}

ol, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

</style>

.full-width {
    display: flex;
    width: 100%;
    flex: 1 1 auto;
}
</style>

.blockquote p {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h1:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h2:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h3:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h4:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}

.text-shadow {
  text-shadow: 0 0 4px #424242;
}
</style>

.scroll-box-8 {
  height:8em;
  overflow-y: scroll;
}
.scroll-box-10 {
  height:10em;
  overflow-y: scroll;
}
.scroll-box-12 {
  height:12em;
  overflow-y: scroll;
}
.scroll-box-14 {
  height:14em;
  overflow-y: scroll;
}
.scroll-box-16 {
  height:16em;
  overflow-y: scroll;
}
.scroll-box-18 {
  height:18em;
  overflow-y: scroll;
}
.scroll-box-20 {
  height:20em;
  overflow-y: scroll;
}
.scroll-box-24 {
  height:24em;
  overflow-y: scroll;
}
.scroll-box-30 {
  height:30em;
  overflow-y: scroll;
}
.scroll-output {
  height: 90%;
  overflow-y: scroll;
}

</style>

# Before we start

## Learning objectives

Understand

+ what data generating process is
+ variation in a variable
+ identification of the impact of a variable

## Table of contents

1. [Data Generating Process and Clean Variations to Use](#dgp)
3. [Identification](#identify)

## Reference

The contents of this lecture borrow heavily from "The Effect" by Nick Huntington-Klein ([book available for free here](https://theeffectbook.net/)).

"Huntington-Klein, N. (2021). The effect: An introduction to research design and causality. Chapman and Hall/CRC."

---

# Data Generating Process and Clean Variations

---
class: middle

# Data Generating Process

The set of underlying laws that determine how the data we observed is created

<br>

+ We cannot see them directly (at least for economic phenomenon)
+ But, we get to observe data generated from it

---
class: middle

# Example (Non-economic)

$$
`\begin{aligned}
F = \frac{G\times m_1 \times m_2}{r^2}
\end{aligned}`
$$

+ `$G$`: gravitational constant
+ `$m_1$`: mass of object 1
+ `$m_2$`: mass of object 2
+ `$r$`: distance between the two objects
+ `$F$`: force pullinf the two objects together

+ This is the physical law (data generating process) that governs how an object (say a ball) fall to the ground when it is let go of your hand (ignoring wind). The observation that an object has fallen is data.

<br>

We did not know the data generating process until Newton discovers it. By looking at the data, he learned that the underlying process has to be the one above. We are trying to do the same.

---
class: middle

# A toy example

+ Income is log-normally distributed
+ Being brown-haired gives you a 10% income boost 
+ 20% of people are naturally brown-haired 
+ Having a college degree gives you a 20% income boost 
+ 30% of people have college degrees 
+ 40% of people who don't have brown hair or a college degree will choose to dye their hair brown

<br>

You are interested in learning the impact of having brown-hair on income.

---
class: middle

```r
set.seed(89403)

N <- 10000 #* number of observations

data <-
  tibble(
    brown_haired = runif(N) < 0.2, # 1 if naturally brown haired
    college = runif(N) < 0.3, # 1 if have college degrees
    error = 0.1 * rnorm(N), #* error term
  ) %>%
  mutate(
    dye_to_brown = runif(N) < 0.4, # whether to dye hair to brown or not
    brown_haired = ifelse(
      dye_to_brown == TRUE & college == FALSE,
      TRUE,
      brown_haired
    )
  ) %>%
  mutate(
    income = exp(0.1 * brown_haired + 0.2 * college + error)
  )
```

---
class: middle

$$
`\begin{aligned}
log(income)  = \alpha + \beta \mbox{brown-haired} + v
\end{aligned}`
$$

<br>

`$\beta$` represents the percentage difference in income between brown-hared and non-brown-haired people (baseline is the non-brown-haired people).

<br>

`$E[v|\mbox{brown-haired}] = 0$`?

<br>

+ Correlation between `brown-haired` and `college`?
+ The sign of the impact of `college`?
+ So, the bias is (negative or positive)?

---
class: middle

```r
feols(log(income) ~ i(brown_haired), data = data) %>% tidy()
```

```
## # A tibble: 2 × 5
##   term               estimate std.error statistic  p.value
##   <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)          0.0831   0.00176      47.3 0       
## 2 brown_haired::TRUE   0.0497   0.00272      18.3 1.75e-73
```

Okay, as we expected, we are severely underestimating the impact of `brown_haired`.

<br>

What should we do to get the estimation right?

---
class: middle

```r
feols(log(income) ~ i(brown_haired) + i(college), data = data) %>% tidy()
```

```
## # A tibble: 3 × 5
##   term               estimate std.error statistic p.value
##   <chr>                 <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)        -0.00201   0.00163     -1.23   0.217
## 2 brown_haired::TRUE  0.104     0.00213     49.1    0    
## 3 college::TRUE       0.201     0.00227     88.6    0
```

Okay, as we expected, we are good now.

---
class: middle

But, let's think of another way to recover a good estimate of the impact of being `brown-haired` using the information we have about the data generating process while still using the naive model of just regressing `log(income)` on `brown-haired`. Any idea?

+ Income is log-normally distributed
+ Being brown-haired gives you a 10% income boost (<span style = "color: blue;"> pretend you do not know this, as this is the objective </span>)
+ 20% of people are naturally brown-haired 
+ Having a college degree gives you a 20% income boost 
+ 30% of people have college degrees 
+ 40% of people who don't have brown hair or a college degree will choose to dye their hair brown

---
class: middle

Notice that those with college degrees do not dye their hair to brown. So, if we just use the observations for those people, we can cleanly identify the impact of `brown-haired`.

```r
feols(
  log(income) ~ i(brown_haired),
  data = filter(data, college == TRUE)
) %>%
  tidy()
```

```
## # A tibble: 2 × 5
##   term               estimate std.error statistic   p.value
##   <chr>                 <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)           0.199   0.00202      98.4 0        
## 2 brown_haired::TRUE    0.104   0.00448      23.1 1.84e-109
```

---
class: middle

How a variable changes from observation to observation

<br>

+ Bad variations: variations in `brown_haired` for the entire sample

`brown_haired` is correlated with `college`, which made us confound (mix) the impact of `brown_haired` and `college`, when the naive model is used.

+ Clean variations: variations in `brown_haired` only for the samples with college degrees

No ones with college degrees do not dye their to brown. So, if we just focus on (limit ourselves to) those people, variations in `brown_haired` is not correlated with `college`. So, we were able to estimate the impact of `brown_haired` even with the simple model.

<br>

+ This is just a toy example and we could have just included `college` as a covariate.
+ But, this is just to get you start thinking about different types of variations there are in the dataset. 
+ There are "clean" and "dirty" variations. Limiting ourselves to only the "clean" variation is very important.

---
class: middle

# Key message through this toy example

By understanding the data generating process, we know why we cannot trust the naive estimation of the impact of `brown_haired` on `income`.

<br>

We make gguse of our knowledge about a part of the data generating process to identify the imapact of `brown_haired` credibly:

"40% of people who don't have brown hair or a college degree will choose to dye their hair brown"

Of course, in real world applications, we almost always do not have such a clean and crucial information. But, knowing the context of your study lets you make credible "assumptions" that will let you find clean "variations" that we can harness to estimated the impact of our variable of interest credibly.

---
class: middle

# A further example of looking for clean variations

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#identification_x_files/figure-html/unnamed-chunk-7-1.png" alt="Weekly Sales of Avocados in California, Jan 2015 - Match 2018" width="60%" />
<p class="caption">Weekly Sales of Avocados in California, Jan 2015 - Match 2018</p>
</div>

---
class: middle

You are interested in understanding the impact of avocado price on its consumption.

<br>

Can you answer the research question from this figure?

]

.right6[
<br>
<br>
<br>
<br>
<br>
<img src="data:image/png;base64,#identification_x_files/figure-html/unnamed-chunk-8-1.png" width="80%" style="display: block; margin: auto;" />
]

---
class: middle

+ They are negatively associated with each other
  - Avocado sales tend to be lower in weeks where the price of avocados is high. 
  - Prices tend to be higher in weeks where fewer avocados are sold

+ You cannot make a causal statement like this:

"An increase in avocado price make consumers buy less avocado."

+ Reverse causality
  - price affects demand
  - demand affects price
]

.right6[
<br>
<br>
<br>
<br>
<br>
<img src="data:image/png;base64,#identification_x_files/figure-html/unnamed-chunk-9-1.png" width="80%" style="display: block; margin: auto;" />
]

---
class: middle

Reverse Causality: Price affects demand and demand affects price.

<br>

Suppose your can run an experiment on the avocado market (ideal situation). If we want to identify the impact of price on demand free of confusing with the impact of demand on price, what would you do?

<br>

Now, suppose you learned the following fact after studying the supply and purchasing mechanism on the avocado market:

<span style = "color: blue;"> At the beginning of each month, avocado suppliers make a plan for what avocado prices will be each week in that month, and never change their plans until the next month. </span>

This means that within the same month changes in avocado price every week is not a function of how much avocado has been bought in the previous weeks, effectively breaking the causal effect of demand on price.

So, our estimation strategy would be to just look at the variations in demand and price <span style = "color: blue;"> within </span> individual months, but ignore variations in price <span style = "color: blue;"> between</span> months.

---
class: middle

An example of clean variations in price and its impact on demand.

We will talk about how we can use only the within-month variations in avocado price, but leave out the between-month variations in avocado price econometrically using R.

---
class: middle

# Key message through this example

By understanding the data generating process (knowing how any economic market works), we recognize the problem of simply looking at the relationship between the avocado price and demand to conclude the causal impact of price on demand (reverse causality).

We study the context very well and how the avocado market works in California (of course it is not really how CA avocado market works in reality) and make use of the information to identify the "clean" variations in avocado price to identify its impact on demand.