Multivariate Regression

# Multivariate Regression
### AECN 896-002

---

.remark-slide-content.hljs-github h1 {
  margin-top: 5px;  
  margin-bottom: 25px;  
}

.remark-slide-content.hljs-github {
  padding-top: 10px;  
  padding-left: 30px;  
  padding-right: 30px;  
}

.panel-tabs {
  
  color: #841F27;
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 0px;  
  padding-bottom: 0px;  
}

.panel-tab {
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 3px;  
  margin-right: 3px;  
  padding-top: 0px;  
  padding-bottom: 0px;  
}

.panelset .panel-tabs .panel-tab {
  min-height: 40px;
}

.remark-slide th {
  border-bottom: 1px solid #ddd;
}

.remark-slide thead {
  border-bottom: 0px;
}

.gt_footnote {
  padding: 2px;  
}

.remark-slide table {
  border-collapse: collapse;
}

.remark-slide tbody {
  border-bottom: 2px solid #666;
}

.important {
  background-color: lightpink;
  border: 2px solid blue;
  font-weight: bold;
}

.remark-code {
  display: block;
  overflow-x: auto;
  padding: .5em;
  background: #ffe7e7;
}

.hljs-github .hljs {
  background: #f2f2fd;
}

.remark-inline-code {
  padding-top: 0px;
  padding-bottom: 0px;
  background-color: #e6e6e6;
}

.r.hljs.remark-code.remark-inline-code{
  font-size: 0.9em
}

.left-full {
  width: 80%;
  height: 92%;
  float: left;
}

.left-code {
  width: 38%;
  height: 92%;
  float: left;
}

.right-plot {
  width: 60%;
  float: right;
  padding-left: 1%;
}

.left5 {
  width: 49%;
  height: 92%;
  float: left;
}

.right5 {
  width: 49%;
  float: right;
  padding-left: 1%;
}

.left3 {
  width: 29%;
  height: 92%;
  float: left;
}

.right7 {
  width: 69%;
  float: right;
  padding-left: 1%;
}

.left4 {
  width: 38%;
  height: 92%;
  float: left;
}

.right6 {
  width: 60%;
  float: right;
  padding-left: 1%;
}

ul li{
  margin: 7px;
}

ul, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

ol li{
  margin: 7px;
}

ol, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

</style>

.full-width {
    display: flex;
    width: 100%;
    flex: 1 1 auto;
}
</style>

.blockquote p {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h1:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h2:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h3:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h4:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}

.text-shadow {
  text-shadow: 0 0 4px #424242;
}
</style>

.scroll-box-8 {
  height:8em;
  overflow-y: scroll;
}
.scroll-box-10 {
  height:10em;
  overflow-y: scroll;
}
.scroll-box-12 {
  height:12em;
  overflow-y: scroll;
}
.scroll-box-14 {
  height:14em;
  overflow-y: scroll;
}
.scroll-box-16 {
  height:16em;
  overflow-y: scroll;
}
.scroll-box-18 {
  height:18em;
  overflow-y: scroll;
}
.scroll-box-20 {
  height:20em;
  overflow-y: scroll;
}
.scroll-box-24 {
  height:24em;
  overflow-y: scroll;
}
.scroll-box-30 {
  height:30em;
  overflow-y: scroll;
}
.scroll-output {
  height: 90%;
  overflow-y: scroll;
}

</style>

$$
\def\sumten{\sum_{i=1}^{10}}
$$

$$
\def\sumn{\sum_{i=1}^{n}}
$$

# Outline

1. [Introduction](#mvr)
2. [FWL theorem](#fwl)
3. [Samll Sample Property](#ssp)

---

# Multivariate Regression: Introduction

---
class: middle

# Univariate vs Multivariate Regression Models

The most important assumption `$E[u|x] = 0$` (zero conditional mean) is almost always violated (unless you data comes from randomized experiments) because all the other variables are sitting in the error term, which can be correlated with `$x$`.

More independent variables mean less factors left in the error term, which makes the endogeneity problem <span style = "color: blue;"> less </span>severe

---
class: middle

.content-box-green[**Bi-variate vs. Uni-variate**]
  
`\begin{aligned}
  \mbox{Bi-variate}\;\; wage = & \beta_0 + \beta_1 educ + \beta_2 exper + u_2 \\
  \mbox{Uni-variate}\;\; wage = & \beta_0 + \beta_1 educ + u_1 (=u_2+\beta_2 exper)
\end{aligned}`

+ **bi-variate**: able to measure the effect of education on wage, <span style = "color: blue;"> holding experience fixed </span> because experience is modeled explicitly (<span style = "color: red;"> We say `$exper$` is controlled for. </span>)

+ **uni-variate**: `$\hat{\beta_1}$` is biased unless experience is uncorrelated with education because experience was in error term

---
class: middle

The impact of per student spending (`expend`) on standardized test score (`avgscore`) at the high school level

`\begin{aligned}
avgscore= & \beta_0+\beta_1 expend + u_1 (=u_2+\beta_2 avginc) \notag \\
avgscore= & \beta_0+\beta_1 expend +\beta_2 avginc + u_2 \notag
\end{aligned}`

---
class: middle

# Model with two independent variables

More generally,

`\begin{aligned}
  y=\beta_0+\beta_1 x_1 + \beta_2 x_2 + u
\end{aligned}`

+ `$\beta_0$`: intercept
+ `$\beta_1$`: measure the change in `$y$` with respect to `$x_1$`, holding other factors fixed
+ `$\beta_2$`: measure the change in `$y$` with respect to `$x_1$`, holding other factors fixed

---
class: middle

# The Crucial Condition (Assumption) for Unbiasedness of the OLS Estimator

For `$y = \beta_0 + \beta_1x + u$`,

`$E[u|x]=0$`

<br>

For `$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u$`,

+ Mathematically: `$E[u|x_1,x_2]=0$`
+ Verbally: for any values of `$x_1$` and `$x_2$`, the expected value of the unobservables is zero

---
class: middle

In the following wage model,

`\begin{aligned}
  wage = & \beta_0 + \beta_1 educ + \beta_2 exper + u
\end{aligned}`

Mean independence condition is

`\begin{aligned}
  E[u|educ,exper]=0
\end{aligned}`

**Verbally**: this condition would be satisfied if innate ability of students is on average unrelated to education level and experience.

---
class: middle

# The model with `$k$` independent variables

`\begin{aligned}
  y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k x_k + u
\end{aligned}`

`$\beta_{OLS}$` (OLS estimators of `$\beta$`s) is unbiased if,

`\begin{aligned}
    E[u|x_1,x_2,\dots,x_k]=0
\end{aligned}`

**Verbally**: this condition would be satisfied if the error term is uncorrelated wtih any of the independent variables, `$x_1,x_2,\dots,x_k$`.

---
class: middle

# Deriving OLS estimators

Find the combination of `$\beta$`s that minimizes the sum of squared residuals

Denoting the collection of `$\hat{\beta}$`s as `$\hat{\theta} (=\{\hat{\beta_0},\hat{\beta_1},\dots,\hat{\beta_k}\})$`,

`\begin{aligned}
    Min_{\theta} \sum_{i=1}^n \Big[ y_i-(\hat{\beta_0}+\hat{\beta_1} x_{1,i} + \hat{\beta_2} x_{2,i} + \dots + \hat{\beta_k} x_{k,i}) \Big]^2
\end{aligned}`

---
class: middle

Find the FOCs by partially differentiating the objective function (sum of squared residuals) wrt each of `$\hat{\theta} (=\{\hat{\beta_0},\hat{\beta_1},\dots,\hat{\beta_k}\})$`,

`\begin{aligned}
    \sum_{i=1}^n(y_i-(\hat{\beta_0}+\hat{\beta_1} x_{1,i} + \hat{\beta_2} x_{2,i} + \dots + \beta_k x_{k,i}) = & 0 \;\; (\hat{\beta}_0) \\
    \sum_{i=1}^n x_{i,1}\Big[ y_i-(\hat{\beta_0}+\hat{\beta_1} x_{1,i} + \hat{\beta_2} x_{2,i} + \dots + \beta_k x_{k,i}) \Big]= & 0  \;\; (\hat{\beta}_1) \\
  \sum_{i=1}^n x_{i,2}\Big[ y_i-(\hat{\beta_0}+\hat{\beta_1} x_{1,i} + \hat{\beta_2} x_{2,i} + \dots + \beta_k x_{k,i}) \Big]= & 0  \;\; (\hat{\beta}_2) \\
  \vdots \\
  \sum_{i=1}^n x_{i,k}\Big[ y_i-(\hat{\beta_0}+\hat{\beta_1} x_{1,i} + \hat{\beta_2} x_{2,i} + \dots + \beta_k x_{k,i}) \Big]= & 0  \;\; (\hat{\beta}_k) \\
\end{aligned}`

---
class: middle

Or more succinctly,
`\begin{aligned}
  \sum_{i=1}^n \hat{u}_i = & 0 \;\; (\hat{\beta}_0) \\
  \sum_{i=1}^n x_{i,1}\hat{u}_i = & 0  \;\; (\hat{\beta}_1) \\
  \sum_{i=1}^n x_{i,2}\hat{u}_i = & 0  \;\; (\hat{\beta}_2) \\
  \vdots \\
  \sum_{i=1}^n x_{i,k}\hat{u}_i = & 0  \;\; (\hat{\beta}_k) \\
\end{aligned}`

---
class: middle

# Implementation of multivariate OLS

```r
#--- load the fixest package ---#
library(fixest)

#--- generate data ---#
N <- 100 # sample size
x1 <- rnorm(N) # independent variable
x2 <- rnorm(N) # independent variable
u <- rnorm(N) # error
y <- 1 + x1 + x2 + u # dependent variable
data <- data.frame(y = y, x1 = x1, x2 = x2)

#--- OLS ---#
reg <- feols(y ~ x1 + x2, data = data)

#* print the results
reg
```

```
## OLS estimation, Dep. Var.: y
## Observations: 100 
## Standard-errors: IID 
##             Estimate Std. Error  t value   Pr(>|t|)    
## (Intercept) 1.006640   0.096153 10.46911  < 2.2e-16 ***
## x1          0.931657   0.095026  9.80420 3.5494e-16 ***
## x2          1.177465   0.097780 12.04197  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.946416   Adj. R2: 0.713892
```

---
class: middle

When you are asked to present regression results in assignments or your final paper, use the `msummary()` function from the `modelsummary` package.

```r
#* load the package (isntall it if you have not)
library(modelsummary)

#* run regression
reg_results <- feols(speed ~ dist, data = cars)

#* report regression table
msummary(
  reg_results,
  # keep these options as they are
  stars = TRUE,
  gof_omit = "IC|Log|Adj|F|Pseudo|Within"
)
```
]

.right5[
<table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> Model 1 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 8.284*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.874) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dist </td>
   <td style="text-align:center;"> 0.166*** </td>
  </tr>
  <tr>
   <td style="text-align:left;box-shadow: 0px 1px">  </td>
   <td style="text-align:center;box-shadow: 0px 1px"> (0.017) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Num.Obs. </td>
   <td style="text-align:center;"> 50 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> R2 </td>
   <td style="text-align:center;"> 0.651 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Std.Errors </td>
   <td style="text-align:center;"> IID </td>
  </tr>
</tbody>
<tfoot><tr><td style="padding: 0; " colspan="100%">
<sup></sup> + p &lt; 0.1, * p &lt; 0.05, ** p &lt; 0.01, *** p &lt; 0.001</td></tr></tfoot>
</table>
]

---

# Frisch–Waugh–Lovell Theorem

---
class: middle

Consider the following simple model,

`\begin{aligned}
  y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \beta_3 x_{3,i} + u_i
\end{aligned}`

Suppose you are interested in estimating only `$\beta_1$`.

---
class: middle

Let's consider the following two methods,

Regress `$y$` on `$x_1$`, `$x_2$`, and `$x_3$` with an intercept to estimate `$\beta_0$`, `$\beta_1$`, `$\beta_2$`, `$\beta_3$` at the same time (just like you normally do)

+ regress `$y$` on `$x_2$` and `$x_3$` with an intercept and get residuals, which we call `$\hat{u}_y$`
+ regress `$x_1$` on `$x_2$` and `$x_3$` with an intercept and get residuals, which we call `$\hat{u}_{x_1}$`
+ regress `$\hat{u}_y$` on `$\hat{u}_{x_1}$` `$(\hat{u}_y=\alpha_1 \hat{u}_{x_1}+v_3)$`

Methods 1 and 2 produces the same coefficient estimate on `$x_1$`

`$$\hat{\beta_1} = \hat{\alpha_1}$$`

---
class: middle

# Partialing out Interpretation from Method 2

Regress `$y$` on `$x_2$` and `$x_3$` with an intercept and get residuals, which we call `$\hat{u}_y$`

+ `$\hat{u}_y$` is void of the impact of `$x_2$` and `$x_3$` on `$y$`

Regress `$x_1$` on `$x_2$` and `$x_3$` with an intercept and get residuals, which we call `$\hat{u}_{x_1}$`

+ `$\hat{u}_{x_1}$` is void of the impact of `$x_2$` and `$x_3$` on `$x_1$`

Regress `$\hat{u}_y$` on `$\hat{u}_{x_1}$`, which produces an estimte of `$\beta_1$` that is identical to that you can get from regressin `$y$` on `$x_1$`, `$x_2$`, and `$x_3$`

---
class: middle

# Interpretation

+ Regressing `$y$` on all explanatory variables `$(x_1$`, `$x_2$`, and `$x_3)$` in a multivariate regression is as if you are looking at the impact of a single explanatory variable with the effects of all the other effects partiled out

+ In other words, including variables beyond your variable of interest lets you <span style = "color: red;"> control for (remove the effect of) </span> other variables, avoiding confusing the impact of the variable of interest with the impact of other variables.

---

# Small Sample Properties of OLS Estimators

---
class: middle

OLS estimators of multivariate models are unbiased under <span style = "color: blue;"> certain </span> conditions

---
class: middle

Your model is correct (Assumption `$MLR.1$`)

<br>

Random sampling (Assumption `$MLR.2$`)

<br>

No perfect collinearity (Assumption `$MLR.3$`)

---
class: middle

# Perfect Collinearity

Any variable cannot be a linear function of the other variables

`\begin{aligned}
  wage = \beta_0 + \beta_1 educ + \beta_2 (3\times educ) + u
\end{aligned}`

(<span style = "color: blue;"> More on this later when we talk about dummy variables</span>)

---
class: middle

`\begin{aligned}
  E[u|x_1,x_2,\dots,x_k]=0 \;\;\mbox{(Assumption MLR.4)}
\end{aligned}`

---
class: middle

If all the conditions `$MLR.1\sim MLR.4$` are satisfied, OLS estimators are unbiased.

$$
\def\ehb{E[\hat{\beta}_j]}
`\begin{aligned}
  \ehb=\beta_j \;\; ^\forall j=0,1,\dots,k
\end{aligned}`
$$

---
class: middle

`$$E[u|x_1,x_2,\dots,x_k] = f(x_1,x_2,\dots,x_k) \ne 0$$`

.content-box-green[**What could cause endogeneity problem?**]
+ functional form misspecification
`\begin{aligned}
  wage = & \beta_0 + \beta_1 log(x_1) + \beta_2 x_2 + u_1 \;\;\mbox{(true)}\\
  wage = & \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u_2 (=log(x_1)-x_1) \;\; \mbox{(yours)}
\end{aligned}`
+ omission of variables that are correlated with any of `$x_1,x_2,\dots,x_k$` (<span style = "color: blue;"> more on this soon </span>)
+ <span style = "color: blue;"> other sources of enfogeneity later </span>

---
class: middle

# Variance of the OLS estimators

`\begin{aligned}
Var(u|x_1,\dots,x_k)=\sigma^2 \;\;\mbox{(Assumption MLR.5)}
\end{aligned}`

<br>

Under conditions `$MLR.1$` through `$MLR.5$`, conditional on the sample values of the independent variables,

`\begin{aligned}
    Var(\hat{\beta}_j)= \frac{\sigma^2}{SST_j(1-R^2_j)},
\end{aligned}`

where `$SST_j= \sum_{i=1}^n (x_{ji}-\bar{x_j})^2$` and `$R_j^2$` is the R-squared from regressing `$x_j$` on all other independent variables including an intercept. (<span style = "color: blue;"> We will revisit this equation</span>)

---
class: middle

# Estimating `$\sigma^2$`

Just like uni-variate regression, you need to estimate `$\sigma^2$` if you want to estimate the variance (and standard deviation) of the OLS estimators.

.content-box-green[**uni-variate regression**]
`\begin{aligned}
  \hat{\sigma}^2=\sum_{i=1}^N \frac{\hat{u}_i^2}{n-2}
\end{aligned}`

A model with `$k$` independent variables with intercept.

`\begin{aligned}
  \hat{\sigma}^2=\sum_{i=1}^N \frac{\hat{u}_i^2}{n-(k+1)}
\end{aligned}`

You solved `$k+1$` simultaneous equations to get `$\hat{\beta}_j$` `$(j=0,\dots,k)$`. So, once you know the value of `$n-k-1$` of the residuals, you know the rest.

---
class: middle

The <span style = "color: red;"> estimator </span> of the variance of the OLS estimator is therefore

$$
`\begin{aligned}
\widehat{Var{\hat{\beta}_j}} = \frac{\hat{\sigma}^2}{SST_j(1-R^2_j)}
\end{aligned}`
$$