class: center, middle, inverse, title-slide # Dealing with Endogeneity: Instrumental Variable ### AECN 396/896-002 --- <style type="text/css"> @media print { .has-continuation { display: block !important; } } .remark-slide-content.hljs-github h1 { margin-top: 5px; margin-bottom: 25px; } .remark-slide-content.hljs-github { padding-top: 10px; padding-left: 30px; padding-right: 30px; } .panel-tabs { <!-- color: #062A00; --> color: #841F27; margin-top: 0px; margin-bottom: 0px; margin-left: 0px; padding-bottom: 0px; } .panel-tab { margin-top: 0px; margin-bottom: 0px; margin-left: 3px; margin-right: 3px; padding-top: 0px; padding-bottom: 0px; } .panelset .panel-tabs .panel-tab { min-height: 40px; } .remark-slide th { border-bottom: 1px solid #ddd; } .remark-slide thead { border-bottom: 0px; } .gt_footnote { padding: 2px; } .remark-slide table { border-collapse: collapse; } .remark-slide tbody { border-bottom: 2px solid #666; } .important { background-color: lightpink; border: 2px solid blue; font-weight: bold; } .remark-code { display: block; overflow-x: auto; padding: .5em; background: #ffe7e7; } .hljs-github .hljs { background: #f2f2fd; } .remark-inline-code { padding-top: 0px; padding-bottom: 0px; background-color: #e6e6e6; } .r.hljs.remark-code.remark-inline-code{ font-size: 0.9em } .left-full { width: 80%; height: 92%; float: left; } .left-code { width: 38%; height: 92%; float: left; } .right-plot { width: 60%; float: right; padding-left: 1%; } .left5 { width: 49%; height: 92%; float: left; } .right5 { width: 49%; float: right; padding-left: 1%; } .left3 { width: 29%; height: 92%; float: left; } .right7 { width: 69%; float: right; padding-left: 1%; } .left4 { width: 38%; height: 92%; float: left; } .right6 { width: 60%; float: right; padding-left: 1%; } ul li{ margin: 7px; } ul, li{ margin-left: 15px; padding-left: 0px; } ol li{ margin: 7px; } ol, li{ margin-left: 15px; padding-left: 0px; } </style> <style type="text/css"> .content-box { box-sizing: border-box; background-color: #e2e2e2; } .content-box-blue, .content-box-gray, .content-box-grey, .content-box-army, .content-box-green, .content-box-purple, .content-box-red, .content-box-yellow { box-sizing: border-box; border-radius: 5px; margin: 0 0 10px; overflow: hidden; padding: 0px 5px 0px 5px; width: 100%; } .content-box-blue { background-color: #F0F8FF; } .content-box-gray { background-color: #e2e2e2; } .content-box-grey { background-color: #F5F5F5; } .content-box-army { background-color: #737a36; } .content-box-green { background-color: #d9edc2; } .content-box-purple { background-color: #e2e2f9; } .content-box-red { background-color: #ffcccc; } .content-box-yellow { background-color: #fef5c4; } .content-box-blue .remark-inline-code, .content-box-blue .remark-inline-code, .content-box-gray .remark-inline-code, .content-box-grey .remark-inline-code, .content-box-army .remark-inline-code, .content-box-green .remark-inline-code, .content-box-purple .remark-inline-code, .content-box-red .remark-inline-code, .content-box-yellow .remark-inline-code { background: none; } .full-width { display: flex; width: 100%; flex: 1 1 auto; } </style> <style type="text/css"> blockquote, .blockquote { display: block; margin-top: 0.1em; margin-bottom: 0.2em; margin-left: 5px; margin-right: 5px; border-left: solid 10px #0148A4; border-top: solid 2px #0148A4; border-bottom: solid 2px #0148A4; border-right: solid 2px #0148A4; box-shadow: 0 0 6px rgba(0,0,0,0.5); /* background-color: #e64626; */ color: #e64626; padding: 0.5em; -moz-border-radius: 5px; -webkit-border-radius: 5px; } .blockquote p { margin-top: 0px; margin-bottom: 5px; } .blockquote > h1:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h2:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h3:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h4:first-of-type { margin-top: 0px; margin-bottom: 5px; } .text-shadow { text-shadow: 0 0 4px #424242; } </style> <style type="text/css"> /****************** * Slide scrolling * (non-functional) * not sure if it is a good idea anyway slides > slide { overflow: scroll; padding: 5px 40px; } .scrollable-slide .remark-slide { height: 400px; overflow: scroll !important; } ******************/ .scroll-box-8 { height:8em; overflow-y: scroll; } .scroll-box-10 { height:10em; overflow-y: scroll; } .scroll-box-12 { height:12em; overflow-y: scroll; } .scroll-box-14 { height:14em; overflow-y: scroll; } .scroll-box-16 { height:16em; overflow-y: scroll; } .scroll-box-18 { height:18em; overflow-y: scroll; } .scroll-box-20 { height:20em; overflow-y: scroll; } .scroll-box-24 { height:24em; overflow-y: scroll; } .scroll-box-30 { height:30em; overflow-y: scroll; } .scroll-output { height: 90%; overflow-y: scroll; } </style> # Before we start ## Learning objectives Understand how instrumental variable (IV) estimation works. ## Table of contents 1. [Instrumental Variable (IV) Approach](#inst) 2. [IV in R](#iv-r) --- class: middle # Endogeneity .content-box-red[**Endogeneity**] `\(E[u|x_k] \ne 0\)` (the error term is not correlated with any of the independent variables) -- .content-box-red[**Endogenous independent variable**] If the error term is, .red[for whatever reason], correlated with the independent variable `\(x_k\)`, then we say that `\(x_k\)` is an endogenous independent variable. + Omitted variable + Selection + Reverse causality + Measurement error --- class: inverse, center, middle name: inst # Instrumental Variable (IV) Approach <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- class: middle # Causal Diagram You want to estimate the causal impact of education on income. + Variable of interest: Education + Dependent variable: Income
--- class: middle # Rough Idea of IV Approach Find a variable like `\(Z\)` in the diagram below:
+ `\(Z\)` does <span style = "color: blue;"> NOT </span> affect income <span style = "color: blue;"> directly </span> + `\(Z\)` is correlated with the variable of interest (education) - does not matter which causes which (associattion is enough) + `\(Z\)` is <span style = "color: blue;"> NOT </span> correlated with <span style = "color: red;"> any </span> of the unobservable variables in the error term (including ability) that is making the vairable of interest (education) endogeneous. - `\(Z\)` does not affect ability - abiliyt does not affect `\(Z\)` --- class: middle .content-box-green[**The Model**] `\(y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + u\)` + `\(x_1\)` is endogenous: `\(E[u|x_1] \ne 0\)` (or `\(Cov(u,x_1)\ne 0\)`) + `\(x_2\)` is exogenous: `\(E[u|x_1] = 0\)` (or `\(Cov(u,x_1) = 0\)`) --- class: middle .content-box-green[**Idea (very loosely put)**] Bring in variable(s) (.blue[Instrumental variable(s)]) that does .red[NOT] belong to the model, but .red[IS] related with the endogenous variable, + Using the instrumental variable(s) (which we denote by `\(Z\)`), make the endogenous variable exogenous, which we call .blue[instrumented] variable(s) + Use the variation in the instrumented variable instead of the original endogenous variable to estimate the impact of the original variable --- class: middle # IV estimation procedure .content-box-green[**Step 1**] Using the instrumental variables, make the endogenous variable exogenous, which we call .blue[instrumented] variable -- <br> .content-box-green[**Step 1: mathematically**] + Regress the endogenous variable `\((x_1)\)` on the instrumental variable(s) `\((Z=\{z_1,z_2\}\)`, two instruments here) and all the other exogenous variables `\((x_2\)` here) `\(x_1 = \alpha_0 + \sigma_2 x_2 + \alpha_1 z_1 +\alpha_2 z_2 + v\)` -- + obtain the predicted value of `\(x\)` from the regression `\(\hat{x}_1 = \hat{\alpha}_0 + \hat{\sigma}_2 x_2 + \hat{\alpha}_1 z_1 + \hat{\alpha}_2 z_2\)` --- class: middle # IV estimation procedure .content-box-green[**Step 2**] use the variation in the instrumented variable instead of the original endogenous variable to estimate the impact of the original variable -- <br> .content-box-green[**Step 2: Mathematically**] Regress the dependent variable `\((y)\)` on the instrumented variable `\((\hat{x}_1)\)`, `\(y= \beta_0 + \beta_1 \hat{x_1}+ \beta_2 x_2 + \varepsilon\)` to estimate the coefficient on `\(x\)` in the original model --- class: middle # Example .content-box-green[**Model of interest**] `\(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + (\beta_3 ability + v)\)` + Regress `\(log(wage)\)` on `\(educ\)` and `\(exper\)` `\((ability\)` not included because you do not observe it) + `\((\beta_3 ability + v)\)` is the error term + `\(educ\)` is considered endogenous (correlated with `\(ability\)`) + `\(exper\)` is considered exogenous (not correlated with `\(ability\)`) -- <br> .content-box-green[**Instruments (Z)**] Suppose you selected the following variables as instruments: + IQ test score `\((IQ)\)` + number of siblings `\((sibs)\)` --- class: middle .left4[ .content-box-green[**Step 1:**] Regress `\(educ\)` on `\(exper\)`, `\(IQ\)`, and `\(sibs\)`: `\(educ = \alpha_0 + \alpha_1 exper + \alpha_2 IQ + \alpha_3 sibs + u\)` Use the coefficient estimates on `\(\alpha_0\)`, `\(\alpha_1\)`, `\(\alpha_2\)`, and `\(\alpha_3\)` to predict `\(educ\)` as a function of `\(exper\)`, `\(IQ\)`, and `\(sibs\)`. `\(\hat{educ} = \hat{\alpha_0} + \hat{\alpha_1} exper + \hat{\alpha_2} IQ + \hat{\alpha_3} sibs\)` ] .right6[ ```r library(wooldridge) data("wage2") #* regress educ on exper, IQ, and sibs first_reg <- feols(educ ~ exper + IQ + sibs, data = wage2) #* predict educ as a function of exper, IQ, and sibs wage2 <- mutate(wage2, educ_hat = first_reg$fitted.values) #* seed the predicted values wage2 %>% relocate(educ_hat) %>% head() ``` ``` ## educ_hat wage hours IQ KWW educ exper tenure age married black south urban sibs brthord meduc feduc lwage ## 1 13.26398 769 40 93 35 12 11 2 31 1 0 0 1 1 2 8 8 6.645091 ## 2 14.80686 808 50 119 41 18 11 16 37 1 0 0 1 1 NA 14 14 6.694562 ## 3 14.15410 825 40 108 46 14 11 9 33 1 0 0 1 1 2 14 14 6.715384 ## 4 12.79569 650 40 96 32 12 13 7 32 1 0 0 1 4 3 12 12 6.476973 ## 5 10.73631 562 40 74 27 11 14 5 34 1 0 0 1 10 6 6 11 6.331502 ## 6 14.09006 1400 40 116 43 16 14 2 35 1 1 0 1 1 2 8 NA 7.244227 ``` ] --- class: middle .left4[ .content-box-green[**Step 2:**] Use `\(\hat{educ}\)` in place of `\(educ\)` to estimate the model of interest: `\(log(wage) = \beta_0 + \beta_1 \hat{educ} + \beta_2 exper + u\)` ] .right6[ ```r #* regression with educ_hat in place of educ second_reg <- feols(wage ~ educ_hat + exper, data = wage2) #* see the results second_reg ``` ``` ## OLS estimation, Dep. Var.: wage ## Observations: 935 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1269.7880 214.12821 -5.93004 4.2632e-09 *** ## educ_hat 138.1051 13.10586 10.53766 < 2.2e-16 *** ## exper 31.7955 4.14489 7.67101 4.2899e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 382.0 Adj. R2: 0.104547 ``` ] --- class: middle # When does IV work? Just like OLS needs to satisy some conditions for it to consistently estimate the coefficients, IV approach needs to satisy some conditions for it to work. .content-box-green[**Estimation Procedure**] + Step 1: `\(\hat{x}_1 = \hat{\alpha}_0 +\hat{\sigma}_2 x_2 + \hat{\alpha}_1 z_1 + \hat{\alpha}_2 z_2\)` + Step 2: `\(y = \beta_0 + \beta_1 \hat{x_1}+ \beta_2 x_2 + \varepsilon\)` <br> .content-box-green[**Important question**] What are the conditions under which IV estimation is consistent? The instruments `\((Z)\)` need to satisfy two conditions, which we will discuss. --- class: middle # Condition 1 --- class: middle .content-box-green[**Estimation Procedure**] + Step 1: `\(\hat{x}_1 = \hat{\alpha}_0 +\hat{\sigma}_2 x_2 + \hat{\alpha}_1 z_1 + \hat{\alpha}_2 z_2\)` + Step 2: `\(y = \beta_0 + \beta_1 \hat{x_1}+ \beta_2 x_2 + \varepsilon\)` <br> .content-box-green[**Question**] What happens if `\(Z\)` have no power to explain `\(x_1\)` `\((\alpha_1=0\)` and `\(\alpha_2=0)\)`? -- <br> .content-box-green[**Answer**] + `\(\hat{x}_1=\hat{\alpha}_0+\hat{\sigma}^2 x_2\)` + `\(\hat{\beta}_1?\)` -- That is, `\(\hat{x_1}\)` has no information beyond the information `\(x_2\)` possesses. --- class: middle .content-box-red[**Condition 1**] The instrument(s) `\(Z\)` have jointly significant explanatory power on the endogenous variable `\(x_1\)` .red[after] you control for all the other exogenous variables (here `\(x_2)\)` --- class: middle center # Condition 2 --- class: middle .content-box-green[**Model of interest**] `\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u\)` <br> .content-box-green[**Estimation Procedure**] + Step 1: `\(\hat{x}_1 = \hat{\alpha}_0 +\hat{\sigma}_2 x_2 + \hat{\alpha}_1 z_1 + \hat{\alpha}_2 z_2\)` + Step 2: `\(y = \beta_0 + \beta_1 \hat{x_1}+ \beta_2 x_2 + \varepsilon\)` -- Remember you can break `\(x_1\)` into the predicted part and the residuals. `\(x_1 = \hat{x}_1 + \hat{\varepsilon}\)` where `\(\hat{\varepsilon}\)` is the residual of the first stage estimation. -- Plugging in `\(x_1 = \hat{x}_1 + \hat{\varepsilon}\)` into the model of interest, `\(y = \beta_0 + \beta_1 (\hat{x}_1 + \hat{\varepsilon}) + \beta_2 x_2+ u\)` `\(\;\;\; = \beta_0 + \beta_1 \hat{x}_1 + \beta_2 x_2+ (\beta_1\hat{\varepsilon} + u)\)` So, if you regress `\(y\)` on `\(\hat{x}_1\)` and `\(x_2\)`, then the error term is `\((\beta_1\hat{\varepsilon} + u)\)`. --- class: middle .content-box-green[**Second stage regression**] `\(y = \beta_0 + \beta_1 \hat{x}_1 + \beta_2 x_2+ (\beta_1\hat{\varepsilon} + u)\)` -- <br> .content-box-green[**Question**] What is the condition under which the OLS estimation of `\(\beta_1\)` in the main model is unbiased? -- <br> .content-box-green[**Answer**] `\(\hat{x}_1\)` is not correlated with `\((\beta_1\hat{\varepsilon} + u)\)` -- This in turn means that `\(x_2\)`, `\(z_1\)`, and `\(z_2\)` are not correlated with `\(u\)` (the error term of the true model. `\((\hat{x}_1\)` is always not correlated (orthogonal) with `\(\varepsilon)\)` --- class: middle .content-box-red[**Condition 2**] + `\(z_1\)` and `\(z_2\)` do not belong in the main model, meaning they do not have any explanatory power beyond `\(x_2\)` (they should have been included in the model in the first place as independent variables) + `\(z_1\)` and `\(z_2\)` are not correlated with the error term (there are no unobserved factors in the error term that are correlated with `\(Z)\)` --- class: middle .content-box-green[**Question**] Do you think we can test condition 2? <br> -- .content-box-green[**Answer**] No, because we never observe the error term. <br> -- .content-box-red[**Important**] + All we can do is to .red[argue] that the instruments are not correlated with the error term. -- + In journal articles that use IV method, they make careful arguments as to why their choice of instruments are not correlated with the error term. --- class: middle .content-box-red[**Condition 1**] + The instrument(s) `\(Z\)` have jointly significant explanatory power on the endogenous variable `\(x_1\)` .red[after] you control for all the other exogenous variables (here `\(x_2)\)`} <br> .content-box-red[**Condition 2**] + `\(z_1\)` and `\(z_2\)` do not belong in the main model, meaning they do not have any explanatory power beyond `\(x_2\)` (they should have been included in the model in the first place as independent variables) + `\(z_1\)` and `\(z_2\)` are not correlated with the error term (there are no unobserved factors in the error term that are correlated with `\(Z)\)` -- <br> .content-box-red[**Important**] + Condition 1 is always testable + Condition 2 is NOT testable (unless you have more instruments than endogenous variables) --- class: middle .content-box-green[**Two-stage Least Square (2SLS)**] IV estimator is also called two-stage least squares estimator (2SLS) because it involves two stages of OLS. + Step 1: `\(\hat{x}_1 = \hat{\alpha}_0 +\hat{\sigma}_2 x_2 + \hat{\alpha}_1 z_1 + \hat{\alpha}_2 z_2\)` + Step 2: `\(y = \beta_0 + \beta_1 \hat{x_1}+ \beta_2 x_2 + \varepsilon\)` -- + 2SLS framework is a good way to understand conceptually why and how instrumental variable estimation works + But, IV estimation is done in one-step --- class: middle # Instrumental variable validity --- class: middle .content-box-green[**The model**] `\(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)\)` `educ` is endogenous because of its correlation with `ability`. <br> -- .content-box-green[**Question**] What conditions would a good instrument `\((z)\)` satisfy? <br> -- .content-box-green[**Answer**] + `\(z\)` has explanatory power on `\(educ\)` .blue[after] you control for the impact of `\(epxer\)` on `\(educ\)` + `\(z\)` is uncorrelated with `\(v\)` `\((ability\)` and all the other important unobservables) --- class: middle .content-box-green[**The model**] `\(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)\)` <br> .content-box-green[**An example of instruments**] The last digit of an individual's Social Security Number? (this has been actually used in some journal articles) <br> .content-box-green[**Question**] + Is it uncorrelated with `\(v\)` `\((ability\)` and all the other important unobservables)? -- + does it have explanatory power on `\(educ\)` .blue[after] you control for the impact of `\(epxer\)` on `\(educ\)`? --- class: middle .content-box-green[**The model**] `\(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)\)` <br> .content-box-green[**An example of instruments**] IQ test score <br> .content-box-green[**Question**] + Is it uncorrelated with `\(v\)` `\((ability\)` and all the other important unobservables)? -- + does it have explanatory power on `\(educ\)` .blue[after] you control for the impact of `\(epxer\)` on `\(educ\)`? --- class: middle .content-box-green[**The model**] `\(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)\)` <br> .content-box-green[**An example of instruments**] Mother's education <br> .content-box-green[**Question**] + Is it uncorrelated with `\(v\)` `\((ability\)` and all the other important unobservables)? -- + does it have explanatory power on `\(educ\)` .blue[after] you control for the impact of `\(epxer\)` on `\(educ\)`? --- class: middle .content-box-green[**The model**] `\(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; ( = \beta_3 ability + u)\)` <br> .content-box-green[**An example of instruments**] Number of siblings <br> .content-box-green[**Question**] + Is it uncorrelated with `\(v\)` `\((ability\)` and all the other important unobservables)? -- + does it have explanatory power on `\(educ\)` .blue[after] you control for the impact of `\(epxer\)` on `\(educ\)`? --- class: middle class: inverse, center, middle name: iv-r # Implementation of Instrumental Variable (IV) Estimation in R <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- class: middle .content-box-green[**Model**] `\(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + v \;\; (=\beta_3 ability + u)\)` We believe + `\(educ\)` is endogenous `\((x_1)\)` + `\(exper\)` is exogenous `\((x_2)\)` + we use the number of siblings `\((sibs)\)` and father's education `\((feduc)\)` as the instruments ($Z$) -- <br> .content-box-green[**Terminology**] + exogenous variable included in the model (here, `\(exper\)`) is also called .blue[included instruments] + instruments that do not belong to the main model (here, `\(sibs\)` and `\(feduc\)`) are also called .blue[excluded instruments] + we refer to the collection of included and excluded instruments as .blue[instruments] --- class: middle .content-box-green[**Dataset**] ```r #--- take a look at the data ---# wage2 %>% select(wage, educ, sibs, feduc) %>% head() ``` ``` ## wage educ sibs feduc ## 1 769 12 1 8 ## 2 808 18 1 14 ## 3 825 14 1 14 ## 4 650 12 4 12 ## 5 562 11 10 11 ## 6 1400 16 1 NA ``` --- class: middle We can continue to use the `fixest` package to run IV estimation method. ```r library(fixest) ``` .content-box-green[**Syntax**] ```r felm(dep var ~ included instruments|first stage formula, data = dataset) ``` + `included instruments`: exogenous included variables (do not include endogenous variables here) -- <br> .content-box-green[**first stage formula**] ```r (endogenous vars ~ excluded instruments) ``` -- <br> .content-box-green[**Example**] ```r iv_res <- feols(log(wage) ~ exper | educ ~ sibs + feduc, data = wage2) ``` + `included variable`: * exogenous included variables: `exper` * endogenous included variables: `educ` + `instruments`: * included instruments: `exper` * excluded instruments: `sibs` and `feduc` --- class: middle .content-box-red[**IV regression results**] ```r iv_res ``` ``` ## TSLS estimation, Dep. Var.: log(wage), Endo.: educ, Instr.: sibs, feduc ## Second stage: Dep. Var.: log(wage) ## Observations: 741 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.507316 0.315735 14.27564 < 2.2e-16 *** ## fit_educ 0.137405 0.019215 7.15104 2.0766e-12 *** ## exper 0.037029 0.005694 6.50306 1.4502e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 0.406208 Adj. R2: 0.049979 ## F-test (1st stage), educ: stat = 65.6 , p < 2.2e-16 , on 2 and 737 DoF. ## Wu-Hausman: stat = 13.2 , p = 3.051e-4, on 1 and 737 DoF. ## Sargan: stat = 0.230925, p = 0.630838, on 1 DoF. ``` <br> .content-box-green[**Note**] + When variable `x` is the endogenous variable, `fixest` changes the name of `x` to `x(fit)`. + Here, `educ` has become `educ(fit)`. --- class: middle .content-box-green[**Comparison of OLS and IV Estimation Results**] <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Model 1 </th> <th style="text-align:left;"> Model 2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:left;"> 5.503*** </td> <td style="text-align:left;"> 4.507*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> (0.112) </td> <td style="text-align:left;"> (0.316) </td> </tr> <tr> <td style="text-align:left;"> educ </td> <td style="text-align:left;"> 0.078*** </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> (0.007) </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> exper </td> <td style="text-align:left;"> 0.020*** </td> <td style="text-align:left;"> 0.037*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> (0.003) </td> <td style="text-align:left;"> (0.006) </td> </tr> <tr> <td style="text-align:left;"> fit_educ </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 0.137*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> (0.019) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:left;"> 935 </td> <td style="text-align:left;"> 741 </td> </tr> <tr> <td style="text-align:left;"> R2 </td> <td style="text-align:left;"> 0.131 </td> <td style="text-align:left;"> 0.053 </td> </tr> <tr> <td style="text-align:left;"> Std. errors </td> <td style="text-align:left;"> IID </td> <td style="text-align:left;"> IID </td> </tr> </tbody> <tfoot> <tr> <td style="padding: 0; border:0;" colspan="100%"> <sup></sup> * p < 0.1, ** p < 0.05, *** p < 0.01</td> </tr> </tfoot> </table> -- .content-box-green[**Question**] Do you think `\(sibs\)` and `\(feduc\)` are good instruments? + Condition 1: weak instruments? + Condition 2: uncorrelated with the error term? --- class: middle .content-box-green[**Weak Instrument Test**] We can always test if the excluded instruments are weak or not (test of condition 1). -- <br> .content-box-green[**How**] + Run the 1st stage regression `\(educ = \alpha_0 + \alpha_1 exper + \alpha_2 sibs + \alpha_3 feduc + v\)` -- + test the joint significance of `\(\alpha_2\)` and `\(\alpha_3\)` `\((F\)`-test) If excluded instruments `\((sibs\)` and `\(feduc\)`) are jointly significant, then it would mean that `\(sibs\)` and `\(feduc\)` are not weak instruments, satisfying condition 1. --- class: middle When we ran the IV estimation using `fixest::feols()` earlier, it automatically calculated the F-statistic for the weak instrument test. -- ```r iv_res ``` ``` ## TSLS estimation, Dep. Var.: log(wage), Endo.: educ, Instr.: sibs, feduc ## Second stage: Dep. Var.: log(wage) ## Observations: 741 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.507316 0.315735 14.27564 < 2.2e-16 *** ## fit_educ 0.137405 0.019215 7.15104 2.0766e-12 *** ## exper 0.037029 0.005694 6.50306 1.4502e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 0.406208 Adj. R2: 0.049979 *## F-test (1st stage), educ: stat = 65.6 , p < 2.2e-16 , on 2 and 737 DoF. ## Wu-Hausman: stat = 13.2 , p = 3.051e-4, on 1 and 737 DoF. ## Sargan: stat = 0.230925, p = 0.630838, on 1 DoF. ``` Here, F-test for the null hypothesis of the excluded instruments (`sibs` and `feduc`) do not have any explanatory power on the endogenous variable (`educ`) beyond the included instrument (`exper`) is rejected. --- class: middle Alternatively, you can access the `iv_first_stage` component of the regression results. ```r iv_res$iv_first_stage ``` ``` ## $educ ## TSLS estimation, Dep. Var.: educ, Endo.: educ, Instr.: sibs, feduc ## First stage: Dep. Var.: educ ## Observations: 741 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 14.075273 0.358595 39.25116 < 2.2e-16 *** ## sibs -0.131009 0.030800 -4.25357 2.3749e-05 *** ## feduc 0.205169 0.021909 9.36459 < 2.2e-16 *** ## exper -0.191535 0.016373 -11.69819 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 1.84505 Adj. R2: 0.319802 ## F-test (1st stage): stat = 65.6, p < 2.2e-16, on 2 and 737 DoF. ``` --- class: middle .content-box-green[**Notes**] + It is generally recommended that you have `\(F\)`-stat of over `\(10\)` (this is not a clear-cut criteria that applied to all the empirical cases) + Even if you reject the null if `\(F\)`-stat is small, you may have a problem + You know nothing about if your excluded instruments satisfy Condition 2. + If you cannot reject the null, it is a strong indication that your instruments are weak. Look for other instruments. + Always, always report this test. There is no reason not to. --- class: middle # Consequences of weak instruments .content-box-green[**Data generation**] ```r set.seed(73289) N <- 500 # number of observations u_common <- runif(N) # the term shared by the endogenous variable and the error term z_common <- runif(N) # the term shared by the endogenous variable and instruments x_end <- u_common + z_common + runif(N) # the endogenous variable z_strong <- z_common + runif(N) # strong instrument z_weak <- 0.01 * z_common + 0.99995 * runif(N) # weak instrument u <- u_common + runif(N) # error term y <- x_end + u # dependent variable data <- data.frame(y, x_end, z_strong, z_weak) ``` --- class: middle .content-box-green[**Correlation**] ```r cor(data) ``` ``` ## y x_end z_strong z_weak ## y 1.0000000 0.86492868 0.298704509 -0.108007146 ## x_end 0.8649287 1.00000000 0.419011491 -0.074224622 ## z_strong 0.2987045 0.41901149 1.000000000 0.003839565 ## z_weak -0.1080071 -0.07422462 0.003839565 1.000000000 ``` --- class: middle .content-box-green[**Estimation with the strong instrumental variable**] ```r #--- IV estimation (strong) ---# iv_strong <- feols(y ~ 1 | x_end ~ z_strong, data = data) ``` <br> .content-box-green[**Estimation with the weak instrumental variable**] ```r #--- IV estimation (weak) ---# iv_weak <- feols(y ~ 1 | x_end ~ z_weak, data = data) ``` --- class: middle ```r #--- coefs (strong) ---# tidy(iv_strong) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.883 0.133 6.64 8.20e-11 ## 2 fit_x_end 1.09 0.0856 12.7 2.96e-32 ``` ```r #--- coefs (weak) ---# tidy(iv_weak) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.862 1.10 -0.784 0.434 ## 2 fit_x_end 2.22 0.714 3.11 0.00197 ``` .content-box-green[**Question**] Any notable differences? -- The coefficient estimate on `\(x\_end\)` is far away from the true value in the weak instrument case. --- class: middle .content-box-green[**Comparison of the weak instrument tests**] .scroll-box-10[ ```r #--- diagnostics (strong) ---# iv_strong$iv_first_stage ``` ``` ## $x_end ## TSLS estimation, Dep. Var.: x_end, Endo.: x_end, Instr.: z_strong ## First stage: Dep. Var.: x_end ## Observations: 500 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.020267 0.054304 18.7881 < 2.2e-16 *** ## z_strong 0.507831 0.049312 10.2983 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 0.441428 Adj. R2: 0.173915 ## F-test (1st stage): stat = 106.1, p < 2.2e-16, on 1 and 498 DoF. ``` ] .scroll-box-10[ ```r #--- diagnostics (weak) ---# iv_weak$iv_first_stage ``` ``` ## $x_end ## TSLS estimation, Dep. Var.: x_end, Endo.: x_end, Instr.: z_weak ## First stage: Dep. Var.: x_end ## Observations: 500 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.602495 0.042885 37.36745 < 2.2e-16 *** ## z_weak -0.124495 0.074953 -1.66097 0.097348 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 0.484824 Adj. R2: 0.003512 ## F-test (1st stage): stat = 2.75883, p = 0.097348, on 1 and 498 DoF. ``` ] <br> .content-box-green[**Question**] Any notable differences? -- You cannot reject the null hypothesis of weak instrument in the weak instrument case. --- class: middle .content-box-green[**MC simulation**] ```r B <- 1000 # the number of experiments beta_hat_store <- matrix(0, B, 2) # storage of beta hat for (i in 1:B) { #--- data generation ---# u_common <- runif(N) z_common <- runif(N) x_end <- u_common + z_common + runif(N) z_strong <- z_common + runif(N) z_weak <- 0.01 * z_common + 0.99995 * runif(N) u <- u_common + runif(N) y <- x_end + u data <- data.table(y, x_end, z_strong, z_weak) #--- IV estimation with a strong instrument ---# iv_strong <- feols(y ~ 1 | x_end ~ z_strong, data = data) beta_hat_store[i, 1] <- iv_strong$coefficients[2] #--- IV estimation with a weak instrument ---# iv_weak <- feols(y ~ 1 | x_end ~ z_weak, data = data) beta_hat_store[i, 2] <- iv_weak$coefficients[2] } ``` --- class: middle .content-box-green[**Visualization of the MC Results**] <img src="data:image/png;base64,#iv_x_files/figure-html/unnamed-chunk-24-1.png" width="70%" style="display: block; margin: auto;" /> --- class: middle .content-box-green[**Visualization of the MC Results**] <img src="data:image/png;base64,#iv_x_files/figure-html/unnamed-chunk-25-1.png" width="70%" style="display: block; margin: auto;" /> --- class: middle .content-box-green[**Flow of IV Estimation in Practice**] + Identify endogenous variable(s) and included instrument(s) + Identify potential excluded instrument(s) + .red[Argue] why the excluded instrument(s) you pick is uncorrelated with the error term (.content-box-red[**condition 2**]) + Once you decide what variable(s) to use as excluded instruments, .red[test] whether the excluded instrument(s) is weak or not ( .content-box-red[**condition 1**]) + Implement IV estimation and report the results --- class: middle You can include fixed effects in your IV estimation. .content-box-green[**Syntax**] ```r feols(dep var ~ included instruments | FE | 1st stage formula, data = dataset) ``` .content-box-green[**Example**] Include `married` and `south` as fixed effects. ```r feols(log(wage) ~ exper | married + south | educ ~ feduc + sibs, data = wage2) ``` ``` ## TSLS estimation, Dep. Var.: log(wage), Endo.: educ, Instr.: feduc, sibs ## Second stage: Dep. Var.: log(wage) ## Observations: 741 ## Fixed-effects: married: 2, south: 2 ## Standard-errors: Clustered (married) ## Estimate Std. Error t value Pr(>|t|) ## fit_educ 0.124355 0.003627 34.2906 0.018560 * ## exper 0.032128 0.002260 14.2144 0.044713 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 0.391178 Adj. R2: 0.116588 ## Within R2: 0.069595 ## F-test (1st stage), educ: stat = 61.1 , p < 2.2e-16 , on 2 and 736 DoF. ## Wu-Hausman: stat = 8.98498 , p = 0.002814, on 1 and 735 DoF. ## Sargan: stat = 0.169226, p = 0.6808 , on 1 DoF. ``` --- class: middle Clustered SE? You can just add `cluster = ` option just like we previously did. ```r feols(log(wage) ~ exper | married + south | educ ~ feduc + sibs, cluster = ~black, data = wage2) ``` ``` ## TSLS estimation, Dep. Var.: log(wage), Endo.: educ, Instr.: feduc, sibs ## Second stage: Dep. Var.: log(wage) ## Observations: 741 ## Fixed-effects: married: 2, south: 2 ## Standard-errors: Clustered (black) ## Estimate Std. Error t value Pr(>|t|) ## fit_educ 0.124355 0.005258 23.6526 0.026899 * ## exper 0.032128 0.002798 11.4842 0.055295 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 0.391178 Adj. R2: 0.116588 ## Within R2: 0.069595 ## F-test (1st stage), educ: stat = 61.9 , p < 2.2e-16 , on 2 and 735 DoF. ## Wu-Hausman: stat = 8.98498 , p = 0.002814, on 1 and 735 DoF. ## Sargan: stat = 0.169226, p = 0.6808 , on 1 DoF. ```