class: center, middle, inverse, title-slide # Data Generating Process, Variation, and Identification ### AECN 396/896-002 --- class: middle <style type="text/css"> @media print { .has-continuation { display: block !important; } } .remark-slide-content.hljs-github h1 { margin-top: 5px; margin-bottom: 25px; } .remark-slide-content.hljs-github { padding-top: 10px; padding-left: 30px; padding-right: 30px; } .panel-tabs { <!-- color: #062A00; --> color: #841F27; margin-top: 0px; margin-bottom: 0px; margin-left: 0px; padding-bottom: 0px; } .panel-tab { margin-top: 0px; margin-bottom: 0px; margin-left: 3px; margin-right: 3px; padding-top: 0px; padding-bottom: 0px; } .panelset .panel-tabs .panel-tab { min-height: 40px; } .remark-slide th { border-bottom: 1px solid #ddd; } .remark-slide thead { border-bottom: 0px; } .gt_footnote { padding: 2px; } .remark-slide table { border-collapse: collapse; } .remark-slide tbody { border-bottom: 2px solid #666; } .important { background-color: lightpink; border: 2px solid blue; font-weight: bold; } .remark-code { display: block; overflow-x: auto; padding: .5em; background: #ffe7e7; } .hljs-github .hljs { background: #f2f2fd; } .remark-inline-code { padding-top: 0px; padding-bottom: 0px; background-color: #e6e6e6; } .r.hljs.remark-code.remark-inline-code{ font-size: 0.9em } .left-full { width: 80%; height: 92%; float: left; } .left-code { width: 38%; height: 92%; float: left; } .right-plot { width: 60%; float: right; padding-left: 1%; } .left5 { width: 49%; height: 92%; float: left; } .right5 { width: 49%; float: right; padding-left: 1%; } .left3 { width: 29%; height: 92%; float: left; } .right7 { width: 69%; float: right; padding-left: 1%; } .left4 { width: 38%; height: 92%; float: left; } .right6 { width: 60%; float: right; padding-left: 1%; } ul li{ margin: 7px; } ul, li{ margin-left: 15px; padding-left: 0px; } ol li{ margin: 7px; } ol, li{ margin-left: 15px; padding-left: 0px; } </style> <style type="text/css"> .content-box { box-sizing: border-box; background-color: #e2e2e2; } .content-box-blue, .content-box-gray, .content-box-grey, .content-box-army, .content-box-green, .content-box-purple, .content-box-red, .content-box-yellow { box-sizing: border-box; border-radius: 5px; margin: 0 0 10px; overflow: hidden; padding: 0px 5px 0px 5px; width: 100%; } .content-box-blue { background-color: #F0F8FF; } .content-box-gray { background-color: #e2e2e2; } .content-box-grey { background-color: #F5F5F5; } .content-box-army { background-color: #737a36; } .content-box-green { background-color: #d9edc2; } .content-box-purple { background-color: #e2e2f9; } .content-box-red { background-color: #ffcccc; } .content-box-yellow { background-color: #fef5c4; } .content-box-blue .remark-inline-code, .content-box-blue .remark-inline-code, .content-box-gray .remark-inline-code, .content-box-grey .remark-inline-code, .content-box-army .remark-inline-code, .content-box-green .remark-inline-code, .content-box-purple .remark-inline-code, .content-box-red .remark-inline-code, .content-box-yellow .remark-inline-code { background: none; } .full-width { display: flex; width: 100%; flex: 1 1 auto; } </style> <style type="text/css"> blockquote, .blockquote { display: block; margin-top: 0.1em; margin-bottom: 0.2em; margin-left: 5px; margin-right: 5px; border-left: solid 10px #0148A4; border-top: solid 2px #0148A4; border-bottom: solid 2px #0148A4; border-right: solid 2px #0148A4; box-shadow: 0 0 6px rgba(0,0,0,0.5); /* background-color: #e64626; */ color: #e64626; padding: 0.5em; -moz-border-radius: 5px; -webkit-border-radius: 5px; } .blockquote p { margin-top: 0px; margin-bottom: 5px; } .blockquote > h1:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h2:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h3:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h4:first-of-type { margin-top: 0px; margin-bottom: 5px; } .text-shadow { text-shadow: 0 0 4px #424242; } </style> <style type="text/css"> /****************** * Slide scrolling * (non-functional) * not sure if it is a good idea anyway slides > slide { overflow: scroll; padding: 5px 40px; } .scrollable-slide .remark-slide { height: 400px; overflow: scroll !important; } ******************/ .scroll-box-8 { height:8em; overflow-y: scroll; } .scroll-box-10 { height:10em; overflow-y: scroll; } .scroll-box-12 { height:12em; overflow-y: scroll; } .scroll-box-14 { height:14em; overflow-y: scroll; } .scroll-box-16 { height:16em; overflow-y: scroll; } .scroll-box-18 { height:18em; overflow-y: scroll; } .scroll-box-20 { height:20em; overflow-y: scroll; } .scroll-box-24 { height:24em; overflow-y: scroll; } .scroll-box-30 { height:30em; overflow-y: scroll; } .scroll-output { height: 90%; overflow-y: scroll; } </style> # Before we start ## Learning objectives Understand + what data generating process is + variation in a variable + identification of the impact of a variable ## Table of contents 1. [Data Generating Process and Clean Variations to Use](#dgp) 3. [Identification](#identify) ## Reference The contents of this lecture borrow heavily from "The Effect" by Nick Huntington-Klein ([book available for free here](https://theeffectbook.net/)). "Huntington-Klein, N. (2021). The effect: An introduction to research design and causality. Chapman and Hall/CRC." --- class: inverse, center, middle name: dgp # Data Generating Process and Clean Variations <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- class: middle # Data Generating Process .content-box-red[**Definition**] The set of underlying laws that determine how the data we observed is created <br> .content-box-green[**Features**] + We cannot see them directly (at least for economic phenomenon) + But, we get to observe data generated from it --- class: middle # Example (Non-economic) $$ `\begin{aligned} F = \frac{G\times m_1 \times m_2}{r^2} \end{aligned}` $$ + `\(G\)`: gravitational constant + `\(m_1\)`: mass of object 1 + `\(m_2\)`: mass of object 2 + `\(r\)`: distance between the two objects + `\(F\)`: force pullinf the two objects together + This is the physical law (data generating process) that governs how an object (say a ball) fall to the ground when it is let go of your hand (ignoring wind). The observation that an object has fallen is data. <br> .content-box-green[**Key**] We did not know the data generating process until Newton discovers it. By looking at the data, he learned that the underlying process has to be the one above. We are trying to do the same. --- class: middle # A toy example .content-box-green[**Data Generating Process**] + Income is log-normally distributed + Being brown-haired gives you a 10% income boost + 20% of people are naturally brown-haired + Having a college degree gives you a 20% income boost + 30% of people have college degrees + 40% of people who don't have brown hair or a college degree will choose to dye their hair brown <br> .content-box-green[**Research Goal**] You are interested in learning the impact of having brown-hair on income. --- class: middle .content-box-green[**The data generating process in code**] ```r set.seed(89403) N <- 10000 #* number of observations data <- tibble( brown_haired = runif(N) < 0.2, # 1 if naturally brown haired college = runif(N) < 0.3, # 1 if have college degrees error = 0.1 * rnorm(N), #* error term ) %>% mutate( dye_to_brown = runif(N) < 0.4, # whether to dye hair to brown or not brown_haired = ifelse( dye_to_brown == TRUE & college == FALSE, TRUE, brown_haired ) ) %>% mutate( income = exp(0.1 * brown_haired + 0.2 * college + error) ) ``` --- class: middle .content-box-green[**Your Model**] $$ `\begin{aligned} log(income) = \alpha + \beta \mbox{brown-haired} + v \end{aligned}` $$ <br> -- .content-box-green[**Interpretation**] `\(\beta\)` represents the percentage difference in income between brown-hared and non-brown-haired people (baseline is the non-brown-haired people). <br> -- .content-box-green[**Diagnostics of the zero conditional mean assumption**] `\(E[v|\mbox{brown-haired}] = 0\)`? -- <br> .content-box-green[**Direction of bias?**] + Correlation between `brown-haired` and `college`? + The sign of the impact of `college`? + So, the bias is (negative or positive)? --- class: middle .content-box-green[**Naive model**] ```r feols(log(income) ~ i(brown_haired), data = data) %>% tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.0831 0.00176 47.3 0 ## 2 brown_haired::TRUE 0.0497 0.00272 18.3 1.75e-73 ``` Okay, as we expected, we are severely underestimating the impact of `brown_haired`. <br> .content-box-green[**Question**] What should we do to get the estimation right? --- class: middle .content-box-green[**Sensible model**] ```r feols(log(income) ~ i(brown_haired) + i(college), data = data) %>% tidy() ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.00201 0.00163 -1.23 0.217 ## 2 brown_haired::TRUE 0.104 0.00213 49.1 0 ## 3 college::TRUE 0.201 0.00227 88.6 0 ``` Okay, as we expected, we are good now. --- class: middle But, let's think of another way to recover a good estimate of the impact of being `brown-haired` using the information we have about the data generating process while still using the naive model of just regressing `log(income)` on `brown-haired`. Any idea? + Income is log-normally distributed + Being brown-haired gives you a 10% income boost (<span style = "color: blue;"> pretend you do not know this, as this is the objective </span>) + 20% of people are naturally brown-haired + Having a college degree gives you a 20% income boost + 30% of people have college degrees + 40% of people who don't have brown hair or a college degree will choose to dye their hair brown --- class: middle .content-box-green[**Solution**] Notice that those with college degrees do not dye their hair to brown. So, if we just use the observations for those people, we can cleanly identify the impact of `brown-haired`. ```r feols( log(income) ~ i(brown_haired), data = filter(data, college == TRUE) ) %>% tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.199 0.00202 98.4 0 ## 2 brown_haired::TRUE 0.104 0.00448 23.1 1.84e-109 ``` --- class: middle .content-box-green[**Variation (Definition)**] How a variable changes from observation to observation <br> -- .content-box-green[**clean and bad variations**] + Bad variations: variations in `brown_haired` for the entire sample `brown_haired` is correlated with `college`, which made us confound (mix) the impact of `brown_haired` and `college`, when the naive model is used. + Clean variations: variations in `brown_haired` only for the samples with college degrees No ones with college degrees do not dye their to brown. So, if we just focus on (limit ourselves to) those people, variations in `brown_haired` is not correlated with `college`. So, we were able to estimate the impact of `brown_haired` even with the simple model. <br> -- .content-box-green[**Note**] + This is just a toy example and we could have just included `college` as a covariate. + But, this is just to get you start thinking about different types of variations there are in the dataset. + There are "clean" and "dirty" variations. Limiting ourselves to only the "clean" variation is very important. --- class: middle # Key message through this toy example .content-box-green[**Message 1**] By understanding the data generating process, we know why we cannot trust the naive estimation of the impact of `brown_haired` on `income`. <br> -- .content-box-green[**Message 2**] We make gguse of our knowledge about a part of the data generating process to identify the imapact of `brown_haired` credibly: "40% of people who don't have brown hair or a college degree will choose to dye their hair brown" Of course, in real world applications, we almost always do not have such a clean and crucial information. But, knowing the context of your study lets you make credible "assumptions" that will let you find clean "variations" that we can harness to estimated the impact of our variable of interest credibly. --- class: middle # A further example of looking for clean variations <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#identification_x_files/figure-html/unnamed-chunk-7-1.png" alt="Weekly Sales of Avocados in California, Jan 2015 - Match 2018" width="60%" /> <p class="caption">Weekly Sales of Avocados in California, Jan 2015 - Match 2018</p> </div> --- class: middle .left4[ <br> <br> You are interested in understanding the impact of avocado price on its consumption. <br> .content-box-green[**Questions**] Can you answer the research question from this figure? ] .right6[ <br> <br> <br> <br> <br> <img src="data:image/png;base64,#identification_x_files/figure-html/unnamed-chunk-8-1.png" width="80%" style="display: block; margin: auto;" /> ] --- class: middle .left4[ <br> <br> + They are negatively associated with each other - Avocado sales tend to be lower in weeks where the price of avocados is high. - Prices tend to be higher in weeks where fewer avocados are sold + You cannot make a causal statement like this: "An increase in avocado price make consumers buy less avocado." + Reverse causality - price affects demand - demand affects price ] .right6[ <br> <br> <br> <br> <br> <img src="data:image/png;base64,#identification_x_files/figure-html/unnamed-chunk-9-1.png" width="80%" style="display: block; margin: auto;" /> ] --- class: middle .content-box-green[**Problem**] Reverse Causality: Price affects demand and demand affects price. <br> .content-box-green[**Question**] Suppose your can run an experiment on the avocado market (ideal situation). If we want to identify the impact of price on demand free of confusing with the impact of demand on price, what would you do? -- <br> .content-box-green[**Special contextual knowledge**] Now, suppose you learned the following fact after studying the supply and purchasing mechanism on the avocado market: <span style = "color: blue;"> At the beginning of each month, avocado suppliers make a plan for what avocado prices will be each week in that month, and never change their plans until the next month. </span> -- This means that within the same month changes in avocado price every week is not a function of how much avocado has been bought in the previous weeks, effectively breaking the causal effect of demand on price. So, our estimation strategy would be to just look at the variations in demand and price <span style = "color: blue;"> within </span> individual months, but ignore variations in price <span style = "color: blue;"> between</span> months. --- class: middle An example of clean variations in price and its impact on demand. <img src="data:image/png;base64,#identification_x_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> We will talk about how we can use only the within-month variations in avocado price, but leave out the between-month variations in avocado price econometrically using R. --- class: middle # Key message through this example .content-box-green[**Message 1**] By understanding the data generating process (knowing how any economic market works), we recognize the problem of simply looking at the relationship between the avocado price and demand to conclude the causal impact of price on demand (reverse causality). .content-box-green[**Message 2**] We study the context very well and how the avocado market works in California (of course it is not really how CA avocado market works in reality) and make use of the information to identify the "clean" variations in avocado price to identify its impact on demand.