class: center, middle, inverse, title-slide # Multivariate Regression ### AECN 896-002 --- class: middle <style type="text/css"> @media print { .has-continuation { display: block !important; } } .remark-slide-content.hljs-github h1 { margin-top: 5px; margin-bottom: 25px; } .remark-slide-content.hljs-github { padding-top: 10px; padding-left: 30px; padding-right: 30px; } .panel-tabs { <!-- color: #062A00; --> color: #841F27; margin-top: 0px; margin-bottom: 0px; margin-left: 0px; padding-bottom: 0px; } .panel-tab { margin-top: 0px; margin-bottom: 0px; margin-left: 3px; margin-right: 3px; padding-top: 0px; padding-bottom: 0px; } .panelset .panel-tabs .panel-tab { min-height: 40px; } .remark-slide th { border-bottom: 1px solid #ddd; } .remark-slide thead { border-bottom: 0px; } .gt_footnote { padding: 2px; } .remark-slide table { border-collapse: collapse; } .remark-slide tbody { border-bottom: 2px solid #666; } .important { background-color: lightpink; border: 2px solid blue; font-weight: bold; } .remark-code { display: block; overflow-x: auto; padding: .5em; background: #ffe7e7; } .hljs-github .hljs { background: #f2f2fd; } .remark-inline-code { padding-top: 0px; padding-bottom: 0px; background-color: #e6e6e6; } .r.hljs.remark-code.remark-inline-code{ font-size: 0.9em } .left-full { width: 80%; height: 92%; float: left; } .left-code { width: 38%; height: 92%; float: left; } .right-plot { width: 60%; float: right; padding-left: 1%; } .left5 { width: 49%; height: 92%; float: left; } .right5 { width: 49%; float: right; padding-left: 1%; } .left3 { width: 29%; height: 92%; float: left; } .right7 { width: 69%; float: right; padding-left: 1%; } .left4 { width: 38%; height: 92%; float: left; } .right6 { width: 60%; float: right; padding-left: 1%; } ul li{ margin: 7px; } ul, li{ margin-left: 15px; padding-left: 0px; } ol li{ margin: 7px; } ol, li{ margin-left: 15px; padding-left: 0px; } </style> <style type="text/css"> .content-box { box-sizing: border-box; background-color: #e2e2e2; } .content-box-blue, .content-box-gray, .content-box-grey, .content-box-army, .content-box-green, .content-box-purple, .content-box-red, .content-box-yellow { box-sizing: border-box; border-radius: 5px; margin: 0 0 10px; overflow: hidden; padding: 0px 5px 0px 5px; width: 100%; } .content-box-blue { background-color: #F0F8FF; } .content-box-gray { background-color: #e2e2e2; } .content-box-grey { background-color: #F5F5F5; } .content-box-army { background-color: #737a36; } .content-box-green { background-color: #d9edc2; } .content-box-purple { background-color: #e2e2f9; } .content-box-red { background-color: #ffcccc; } .content-box-yellow { background-color: #fef5c4; } .content-box-blue .remark-inline-code, .content-box-blue .remark-inline-code, .content-box-gray .remark-inline-code, .content-box-grey .remark-inline-code, .content-box-army .remark-inline-code, .content-box-green .remark-inline-code, .content-box-purple .remark-inline-code, .content-box-red .remark-inline-code, .content-box-yellow .remark-inline-code { background: none; } .full-width { display: flex; width: 100%; flex: 1 1 auto; } </style> <style type="text/css"> blockquote, .blockquote { display: block; margin-top: 0.1em; margin-bottom: 0.2em; margin-left: 5px; margin-right: 5px; border-left: solid 10px #0148A4; border-top: solid 2px #0148A4; border-bottom: solid 2px #0148A4; border-right: solid 2px #0148A4; box-shadow: 0 0 6px rgba(0,0,0,0.5); /* background-color: #e64626; */ color: #e64626; padding: 0.5em; -moz-border-radius: 5px; -webkit-border-radius: 5px; } .blockquote p { margin-top: 0px; margin-bottom: 5px; } .blockquote > h1:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h2:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h3:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h4:first-of-type { margin-top: 0px; margin-bottom: 5px; } .text-shadow { text-shadow: 0 0 4px #424242; } </style> <style type="text/css"> /****************** * Slide scrolling * (non-functional) * not sure if it is a good idea anyway slides > slide { overflow: scroll; padding: 5px 40px; } .scrollable-slide .remark-slide { height: 400px; overflow: scroll !important; } ******************/ .scroll-box-8 { height:8em; overflow-y: scroll; } .scroll-box-10 { height:10em; overflow-y: scroll; } .scroll-box-12 { height:12em; overflow-y: scroll; } .scroll-box-14 { height:14em; overflow-y: scroll; } .scroll-box-16 { height:16em; overflow-y: scroll; } .scroll-box-18 { height:18em; overflow-y: scroll; } .scroll-box-20 { height:20em; overflow-y: scroll; } .scroll-box-24 { height:24em; overflow-y: scroll; } .scroll-box-30 { height:30em; overflow-y: scroll; } .scroll-output { height: 90%; overflow-y: scroll; } </style> $$ \def\sumten{\sum_{i=1}^{10}} $$ $$ \def\sumn{\sum_{i=1}^{n}} $$ # Outline 1. [Introduction](#mvr) 2. [FWL theorem](#fwl) 3. [Samll Sample Property](#ssp) --- class: inverse, center, middle name: mvr # Multivariate Regression: Introduction <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- class: middle # Univariate vs Multivariate Regression Models .content-box-green[**Univariate**] The most important assumption `\(E[u|x] = 0\)` (zero conditional mean) is almost always violated (unless you data comes from randomized experiments) because all the other variables are sitting in the error term, which can be correlated with `\(x\)`. .content-box-green[**Multivariate**] More independent variables mean less factors left in the error term, which makes the endogeneity problem <span style = "color: blue;"> less </span>severe --- class: middle .content-box-green[**Bi-variate vs. Uni-variate**] `\begin{aligned} \mbox{Bi-variate}\;\; wage = & \beta_0 + \beta_1 educ + \beta_2 exper + u_2 \\ \mbox{Uni-variate}\;\; wage = & \beta_0 + \beta_1 educ + u_1 (=u_2+\beta_2 exper) \end{aligned}` .content-box-green[**What's different?**] + **bi-variate**: able to measure the effect of education on wage, <span style = "color: blue;"> holding experience fixed </span> because experience is modeled explicitly (<span style = "color: red;"> We say `\(exper\)` is controlled for. </span>) + **uni-variate**: `\(\hat{\beta_1}\)` is biased unless experience is uncorrelated with education because experience was in error term --- class: middle .content-box-green[**Another example**] The impact of per student spending (`expend`) on standardized test score (`avgscore`) at the high school level `\begin{aligned} avgscore= & \beta_0+\beta_1 expend + u_1 (=u_2+\beta_2 avginc) \notag \\ avgscore= & \beta_0+\beta_1 expend +\beta_2 avginc + u_2 \notag \end{aligned}` --- class: middle # Model with two independent variables More generally, `\begin{aligned} y=\beta_0+\beta_1 x_1 + \beta_2 x_2 + u \end{aligned}` + `\(\beta_0\)`: intercept + `\(\beta_1\)`: measure the change in `\(y\)` with respect to `\(x_1\)`, holding other factors fixed + `\(\beta_2\)`: measure the change in `\(y\)` with respect to `\(x_1\)`, holding other factors fixed --- class: middle # The Crucial Condition (Assumption) for Unbiasedness of the OLS Estimator .content-box-green[**Uni-variate**] For `\(y = \beta_0 + \beta_1x + u\)`, `\(E[u|x]=0\)` <br> .content-box-green[**Bi-variate**] For `\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u\)`, + Mathematically: `\(E[u|x_1,x_2]=0\)` + Verbally: for any values of `\(x_1\)` and `\(x_2\)`, the expected value of the unobservables is zero --- class: middle .content-box-green[**Mean independence condition: example**] In the following wage model, `\begin{aligned} wage = & \beta_0 + \beta_1 educ + \beta_2 exper + u \end{aligned}` Mean independence condition is `\begin{aligned} E[u|educ,exper]=0 \end{aligned}` **Verbally**: this condition would be satisfied if innate ability of students is on average unrelated to education level and experience. --- class: middle # The model with `\(k\)` independent variables .content-box-green[**Model**] `\begin{aligned} y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k x_k + u \end{aligned}` .content-box-green[**Mean independence assumption?**] `\(\beta_{OLS}\)` (OLS estimators of `\(\beta\)`s) is unbiased if, `\begin{aligned} E[u|x_1,x_2,\dots,x_k]=0 \end{aligned}` **Verbally**: this condition would be satisfied if the error term is uncorrelated wtih any of the independent variables, `\(x_1,x_2,\dots,x_k\)`. --- class: middle # Deriving OLS estimators .content-box-green[**OLS**] Find the combination of `\(\beta\)`s that minimizes the sum of squared residuals .content-box-green[**So,**] Denoting the collection of `\(\hat{\beta}\)`s as `\(\hat{\theta} (=\{\hat{\beta_0},\hat{\beta_1},\dots,\hat{\beta_k}\})\)`, `\begin{aligned} Min_{\theta} \sum_{i=1}^n \Big[ y_i-(\hat{\beta_0}+\hat{\beta_1} x_{1,i} + \hat{\beta_2} x_{2,i} + \dots + \hat{\beta_k} x_{k,i}) \Big]^2 \end{aligned}` --- class: middle Find the FOCs by partially differentiating the objective function (sum of squared residuals) wrt each of `\(\hat{\theta} (=\{\hat{\beta_0},\hat{\beta_1},\dots,\hat{\beta_k}\})\)`, `\begin{aligned} \sum_{i=1}^n(y_i-(\hat{\beta_0}+\hat{\beta_1} x_{1,i} + \hat{\beta_2} x_{2,i} + \dots + \beta_k x_{k,i}) = & 0 \;\; (\hat{\beta}_0) \\ \sum_{i=1}^n x_{i,1}\Big[ y_i-(\hat{\beta_0}+\hat{\beta_1} x_{1,i} + \hat{\beta_2} x_{2,i} + \dots + \beta_k x_{k,i}) \Big]= & 0 \;\; (\hat{\beta}_1) \\ \sum_{i=1}^n x_{i,2}\Big[ y_i-(\hat{\beta_0}+\hat{\beta_1} x_{1,i} + \hat{\beta_2} x_{2,i} + \dots + \beta_k x_{k,i}) \Big]= & 0 \;\; (\hat{\beta}_2) \\ \vdots \\ \sum_{i=1}^n x_{i,k}\Big[ y_i-(\hat{\beta_0}+\hat{\beta_1} x_{1,i} + \hat{\beta_2} x_{2,i} + \dots + \beta_k x_{k,i}) \Big]= & 0 \;\; (\hat{\beta}_k) \\ \end{aligned}` --- class: middle Or more succinctly, `\begin{aligned} \sum_{i=1}^n \hat{u}_i = & 0 \;\; (\hat{\beta}_0) \\ \sum_{i=1}^n x_{i,1}\hat{u}_i = & 0 \;\; (\hat{\beta}_1) \\ \sum_{i=1}^n x_{i,2}\hat{u}_i = & 0 \;\; (\hat{\beta}_2) \\ \vdots \\ \sum_{i=1}^n x_{i,k}\hat{u}_i = & 0 \;\; (\hat{\beta}_k) \\ \end{aligned}` --- class: middle # Implementation of multivariate OLS .content-box-green[**R code: Implementation in R**] ```r #--- load the fixest package ---# library(fixest) #--- generate data ---# N <- 100 # sample size x1 <- rnorm(N) # independent variable x2 <- rnorm(N) # independent variable u <- rnorm(N) # error y <- 1 + x1 + x2 + u # dependent variable data <- data.frame(y = y, x1 = x1, x2 = x2) #--- OLS ---# reg <- feols(y ~ x1 + x2, data = data) #* print the results reg ``` ``` ## OLS estimation, Dep. Var.: y ## Observations: 100 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.006640 0.096153 10.46911 < 2.2e-16 *** ## x1 0.931657 0.095026 9.80420 3.5494e-16 *** ## x2 1.177465 0.097780 12.04197 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 0.946416 Adj. R2: 0.713892 ``` --- class: middle <!-- # Presenting regression results --> When you are asked to present regression results in assignments or your final paper, use the `msummary()` function from the `modelsummary` package. .left5[ .content-box-green[**Example**] ```r #* load the package (isntall it if you have not) library(modelsummary) #* run regression reg_results <- feols(speed ~ dist, data = cars) #* report regression table msummary( reg_results, # keep these options as they are stars = TRUE, gof_omit = "IC|Log|Adj|F|Pseudo|Within" ) ``` ] .right5[ <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Model 1 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 8.284*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.874) </td> </tr> <tr> <td style="text-align:left;"> dist </td> <td style="text-align:center;"> 0.166*** </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1px"> </td> <td style="text-align:center;box-shadow: 0px 1px"> (0.017) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 50 </td> </tr> <tr> <td style="text-align:left;"> R2 </td> <td style="text-align:center;"> 0.651 </td> </tr> <tr> <td style="text-align:left;"> Std.Errors </td> <td style="text-align:center;"> IID </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> ] --- class: inverse, center, middle name: mvr # Frisch–Waugh–Lovell Theorem <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- class: middle Consider the following simple model, `\begin{aligned} y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \beta_3 x_{3,i} + u_i \end{aligned}` Suppose you are interested in estimating only `\(\beta_1\)`. --- class: middle Let's consider the following two methods, .content-box-green[**Method 1: Regular OLS**] Regress `\(y\)` on `\(x_1\)`, `\(x_2\)`, and `\(x_3\)` with an intercept to estimate `\(\beta_0\)`, `\(\beta_1\)`, `\(\beta_2\)`, `\(\beta_3\)` at the same time (just like you normally do) .content-box-green[**Method 2: 3-step**] + regress `\(y\)` on `\(x_2\)` and `\(x_3\)` with an intercept and get residuals, which we call `\(\hat{u}_y\)` + regress `\(x_1\)` on `\(x_2\)` and `\(x_3\)` with an intercept and get residuals, which we call `\(\hat{u}_{x_1}\)` + regress `\(\hat{u}_y\)` on `\(\hat{u}_{x_1}\)` `\((\hat{u}_y=\alpha_1 \hat{u}_{x_1}+v_3)\)` .content-box-green[**Frisch-Waugh–Lovell theorem**] Methods 1 and 2 produces the same coefficient estimate on `\(x_1\)` `$$\hat{\beta_1} = \hat{\alpha_1}$$` --- class: middle # Partialing out Interpretation from Method 2 .content-box-green[**Step 1**] Regress `\(y\)` on `\(x_2\)` and `\(x_3\)` with an intercept and get residuals, which we call `\(\hat{u}_y\)` + `\(\hat{u}_y\)` is void of the impact of `\(x_2\)` and `\(x_3\)` on `\(y\)` .content-box-green[**Step 2**] Regress `\(x_1\)` on `\(x_2\)` and `\(x_3\)` with an intercept and get residuals, which we call `\(\hat{u}_{x_1}\)` + `\(\hat{u}_{x_1}\)` is void of the impact of `\(x_2\)` and `\(x_3\)` on `\(x_1\)` .content-box-green[**Step 3**] Regress `\(\hat{u}_y\)` on `\(\hat{u}_{x_1}\)`, which produces an estimte of `\(\beta_1\)` that is identical to that you can get from regressin `\(y\)` on `\(x_1\)`, `\(x_2\)`, and `\(x_3\)` --- class: middle # Interpretation + Regressing `\(y\)` on all explanatory variables `\((x_1\)`, `\(x_2\)`, and `\(x_3)\)` in a multivariate regression is as if you are looking at the impact of a single explanatory variable with the effects of all the other effects partiled out + In other words, including variables beyond your variable of interest lets you <span style = "color: red;"> control for (remove the effect of) </span> other variables, avoiding confusing the impact of the variable of interest with the impact of other variables. --- class: inverse, center, middle name: ssp # Small Sample Properties of OLS Estimators <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- class: middle .content-box-green[**Unbiasedness of OLS Estimator**] OLS estimators of multivariate models are unbiased under <span style = "color: blue;"> certain </span> conditions --- class: middle .content-box-green[**Condition 1**] Your model is correct (Assumption `\(MLR.1\)`) <br> .content-box-green[**Condition 2**] Random sampling (Assumption `\(MLR.2\)`) <br> .content-box-green[**Conditions 3**] No perfect collinearity (Assumption `\(MLR.3\)`) --- class: middle # Perfect Collinearity .content-box-green[**No Perfect Collinearity**] Any variable cannot be a linear function of the other variables .content-box-green[**Example (silly)**] `\begin{aligned} wage = \beta_0 + \beta_1 educ + \beta_2 (3\times educ) + u \end{aligned}` (<span style = "color: blue;"> More on this later when we talk about dummy variables</span>) --- class: middle .content-box-red[**Zero Conditional Mean**] `\begin{aligned} E[u|x_1,x_2,\dots,x_k]=0 \;\;\mbox{(Assumption MLR.4)} \end{aligned}` --- class: middle .content-box-green[**Unbiasedness of OLS estimators**] If all the conditions `\(MLR.1\sim MLR.4\)` are satisfied, OLS estimators are unbiased. $$ \def\ehb{E[\hat{\beta}_j]} `\begin{aligned} \ehb=\beta_j \;\; ^\forall j=0,1,\dots,k \end{aligned}` $$ --- class: middle .content-box-green[**Endogeneity (Definition)**] `$$E[u|x_1,x_2,\dots,x_k] = f(x_1,x_2,\dots,x_k) \ne 0$$` .content-box-green[**What could cause endogeneity problem?**] + functional form misspecification `\begin{aligned} wage = & \beta_0 + \beta_1 log(x_1) + \beta_2 x_2 + u_1 \;\;\mbox{(true)}\\ wage = & \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u_2 (=log(x_1)-x_1) \;\; \mbox{(yours)} \end{aligned}` + omission of variables that are correlated with any of `\(x_1,x_2,\dots,x_k\)` (<span style = "color: blue;"> more on this soon </span>) + <span style = "color: blue;"> other sources of enfogeneity later </span> --- class: middle # Variance of the OLS estimators .content-box-green[**Homoeskedasticity**] `\begin{aligned} Var(u|x_1,\dots,x_k)=\sigma^2 \;\;\mbox{(Assumption MLR.5)} \end{aligned}` <br> .content-box-green[**Variance of the OLS estimator**] Under conditions `\(MLR.1\)` through `\(MLR.5\)`, conditional on the sample values of the independent variables, `\begin{aligned} Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)}, \end{aligned}` where `\(SST_j= \sum_{i=1}^n (x_{ji}-\bar{x_j})^2\)` and `\(R_j^2\)` is the R-squared from regressing `\(x_j\)` on all other independent variables including an intercept. (<span style = "color: blue;"> We will revisit this equation</span>) --- class: middle # Estimating `\(\sigma^2\)` Just like uni-variate regression, you need to estimate `\(\sigma^2\)` if you want to estimate the variance (and standard deviation) of the OLS estimators. .content-box-green[**uni-variate regression**] `\begin{aligned} \hat{\sigma}^2=\sum_{i=1}^N \frac{\hat{u}_i^2}{n-2} \end{aligned}` .content-box-green[**multi-variate regression**] A model with `\(k\)` independent variables with intercept. `\begin{aligned} \hat{\sigma}^2=\sum_{i=1}^N \frac{\hat{u}_i^2}{n-(k+1)} \end{aligned}` You solved `\(k+1\)` simultaneous equations to get `\(\hat{\beta}_j\)` `\((j=0,\dots,k)\)`. So, once you know the value of `\(n-k-1\)` of the residuals, you know the rest. --- class: middle The <span style = "color: red;"> estimator </span> of the variance of the OLS estimator is therefore $$ `\begin{aligned} \widehat{Var{\hat{\beta}_j}} = \frac{\hat{\sigma}^2}{SST_j(1-R^2_j)} \end{aligned}` $$