Econometric Modeling

---

---

.remark-slide-content.hljs-github h1 {
  margin-top: 5px;  
  margin-bottom: 25px;  
}

.remark-slide-content.hljs-github {
  padding-top: 10px;  
  padding-left: 30px;  
  padding-right: 30px;  
}

.panel-tabs {
  
  color: #841F27;
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 0px;  
  padding-bottom: 0px;  
}

.panel-tab {
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 3px;  
  margin-right: 3px;  
  padding-top: 0px;  
  padding-bottom: 0px;  
}

.panelset .panel-tabs .panel-tab {
  min-height: 40px;
}

.remark-slide th {
  border-bottom: 1px solid #ddd;
}

.remark-slide thead {
  border-bottom: 0px;
}

.gt_footnote {
  padding: 2px;  
}

.remark-slide table {
  border-collapse: collapse;
}

.remark-slide tbody {
  border-bottom: 2px solid #666;
}

.important {
  background-color: lightpink;
  border: 2px solid blue;
  font-weight: bold;
}

.remark-code {
  display: block;
  overflow-x: auto;
  padding: .5em;
  background: #ffe7e7;
}

.hljs-github .hljs {
  background: #f2f2fd;
}

.remark-inline-code {
  padding-top: 0px;
  padding-bottom: 0px;
  background-color: #e6e6e6;
}

.r.hljs.remark-code.remark-inline-code{
  font-size: 0.9em
}

.left-full {
  width: 80%;
  height: 92%;
  float: left;
}

.left-code {
  width: 38%;
  height: 92%;
  float: left;
}

.right-plot {
  width: 60%;
  float: right;
  padding-left: 1%;
}

.left5 {
  width: 49%;
  height: 92%;
  float: left;
}

.right5 {
  width: 49%;
  float: right;
  padding-left: 1%;
}

.left3 {
  width: 29%;
  height: 92%;
  float: left;
}

.right7 {
  width: 69%;
  float: right;
  padding-left: 1%;
}

.left4 {
  width: 38%;
  height: 92%;
  float: left;
}

.right6 {
  width: 60%;
  float: right;
  padding-left: 1%;
}

ul li{
  margin: 7px;
}

ul, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

ol li{
  margin: 7px;
}

ol, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

</style>

.full-width {
    display: flex;
    width: 100%;
    flex: 1 1 auto;
}
</style>

.blockquote p {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h1:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h2:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h3:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h4:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}

.text-shadow {
  text-shadow: 0 0 4px #424242;
}
</style>

.scroll-box-8 {
  height:8em;
  overflow-y: scroll;
}
.scroll-box-10 {
  height:10em;
  overflow-y: scroll;
}
.scroll-box-12 {
  height:12em;
  overflow-y: scroll;
}
.scroll-box-14 {
  height:14em;
  overflow-y: scroll;
}
.scroll-box-16 {
  height:16em;
  overflow-y: scroll;
}
.scroll-box-18 {
  height:18em;
  overflow-y: scroll;
}
.scroll-box-20 {
  height:20em;
  overflow-y: scroll;
}
.scroll-box-24 {
  height:24em;
  overflow-y: scroll;
}
.scroll-box-30 {
  height:30em;
  overflow-y: scroll;
}
.scroll-output {
  height: 90%;
  overflow-y: scroll;
}

</style>

# Before we start

## Learning objectives

1. Enhance the understanding of the interpretation of various models
1. Post-estimation simulation

## Table of contents

1. [Expanding on Simple Models](#various-models)
2. [Interaction terms](#interaction)
3. [Categorical variable](#qualitative)
4. [R coding tips: categorical variables and interaction terms](#use-i)
5. [Other miscellaneous topics](#misc)

---
class: inverse, center, middle
name: various-models

# More on functional forms

---

# Various econometric models

`$log(y_i)= \beta_0+\beta_1 x_i + u_i$`

<br>

`$y_i= \beta_0+\beta_1 log(x_i) + u_i$`

<br>

`$log(y_i)= \beta_0+\beta_1 log(x_i) + u_i$`

<br>

`$y_i= \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + u_i$`

---

# Quadratic

`$y_i= \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + u_i$`

<br>

Differentiating the both sides wrt `$x_i$`,

`$\frac{\partial y_i}{\partial x_i} = \beta_1 + 2*\beta_2 x_i\Rightarrow  \Delta y_i = (\beta_1 + 2*\beta_2 x_i)\Delta x_i$`

<br>

When `$x$` increases by 1 unit `$(\Delta x_i=1)$`, `$y$` increases by `$\beta_1 + 2*\beta_2 x_i$`

---

# Visualization

Quadratic functional form is quite flexible.

`$y = x + x^2$` `$(\beta_1 = 1, \beta_2 = 1)$`

<img src="data:image/png;base64,#modeling_x_files/figure-html/unnamed-chunk-4-1.png" width="90%" height="80%" style="display: block; margin: auto;" />
]

`$y = 3x-2x^2$` `$(\beta_1 = 3, \beta_2 = -2)$`

<img src="data:image/png;base64,#modeling_x_files/figure-html/unnamed-chunk-5-1.png" width="90%" height="80%" style="display: block; margin: auto;" />
]

---

# Example

The marginal impact of education (the impact of a small change in education on income) may differ what level of education you have had:

+ How much does it help to have two more years of education when you have had education until elementary school?

+ How much does it help to have two more years of education when you have graduated a college?

+ How much does it help to spend two more years as a Ph.D student if you have already spent six years in a Ph.D program

<br>

The marginal impact of education does not seem to be linear.

---

# Implementation in R

When you include a variable that is a transformation of an existing variable, use `I()` function in which you write the mathematical expression of the desired transformation.

```r
#--- prepare a dataset ---#
data("wage1", package = "wooldridge")

#--- run a regression ---#
quad_reg <- fixest::feols(wage ~ female + educ + I(educ^2), data = wage1)

#--- look at the results ---#
broom::tidy(quad_reg)
```

```
## # A tibble: 4 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   5.61     1.38         4.05 5.91e- 5
## 2 female       -2.13     0.277       -7.67 8.50e-14
## 3 educ         -0.416    0.231       -1.81 7.14e- 2
## 4 I(educ^2)     0.0395   0.00964      4.10 4.80e- 5
```

---

`$wage = 5.60 - 2.12\times female -0.416\times educ + 0.039\times educ^2$`

---

`$wage = 5.60 - 2.12\times female -0.416\times educ + 0.039\times educ^2$`

<br>

What is the marginal impact of `$educ$`?

`$\frac{\partial wage}{\partial educ} = ?$`

<br>

`$\frac{\partial wage}{\partial educ} = -0.416+0.039\times 2\times educ$`

+ When `$educ = 4$`, additional year of education is going to increase hourly wage by -0.104 on average

+ When `$educ = 10$`, additional year of education is going to increase hourly wage by 0.364 on average

---

# Statistical significance of the marginal impact

The marginal impact of `$educ$` is:

`$\frac{\partial wage}{\partial educ} = -0.416+0.039\times 2\times educ$`

+ `$educ$`: `$-0.416$` `$(t$`-stat `$= -1.80)$`
+ `$educ^2$`: `$0.039$` `$(t$`-stat `$= 4.10)$`

<br>

So, is the marginal impact of `$educ$` statistically significantly different from `$0$`?

---

# In the linear case

```r
linear_reg <- fixest::feols(wage ~ female + educ, data = wage1)
broom::tidy(linear_reg)
```

```
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    0.623    0.673      0.926 3.55e- 1
## 2 female        -2.27     0.279     -8.15  2.76e-15
## 3 educ           0.506    0.0504    10.1   7.56e-22
```

<br>

`$wage = 0.62+0.51 \times educ$`

---
class: middle

`$wage = 0.62+0.51 \times educ$`

<br>

+ What is the marginal impact of `$educ$`?

0.51

+ Does the marginal impact of education vary depending on the level of education?

No, the model we estimated assumed that the marginal impact of education is constant.

<br>

You can just test if `$\hat{\beta}_{educ}$` (the marginal impact of education) is statistically significantly different from `$0$`, which is just a t-test.

---

# Going back to the quadratic case

With the quadratic specification

+ The marginal impact of education varies depending on your education level

+ There is no single test that tells you whether the marginal impact of education is statistically significant universally

+ Indeed, you need different tests for different values education levels

---

# Example 1

`$\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times educ$`

<br>

Does additional year of education has a statistically significant impact (positive or negative) if your current education level is 4?    
+ `$H_0$`: `$\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 4 =0$`

+ `$H_1$`: `$\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 4 \ne 0$`

<br>

Is this

+ test of a single coefficient? (t-test)
+ test of a single equation with multiple coefficients? (t-test)
+ test of multiples equations with multiple coefficients? (F-test)

---

`$t = \frac{\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 4}{se(\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 4)} = \frac{\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 8}{se(\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 8)}$`

<br>

Remember, a trick to do this test using R is take advantage of the fact that `$F_{1, n-k-1} \sim t_{n-k-1}^2$`.

```r
car::linearHypothesis(quad_reg, "educ + 8*I(educ^2)=0")
```

```
## Linear hypothesis test
## 
## Hypothesis:
## educ  + 8 I(educ^2) = 0
## 
## Model 1: restricted model
## Model 2: wage ~ female + educ + I(educ^2)
## 
##   Df  Chisq Pr(>Chisq)
## 1                     
## 2  1 0.4126     0.5207
```

Since the p-value is 0.529, we do not reject the null.

---
class: middle

# Example 2

`$\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times educ$`

<br>

+ `$H_1$`: `$\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 10 \ne 0$`

<br>

Is this

+ test of a single coefficient? (t-test)
+ test of a single equation with multiple coefficients? (t-test)
+ test of multiples equations with multiple coefficients? (F-test)

---

`$t = \frac{\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 10}{se(\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 10)} = \frac{\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 20}{se(\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 20)}$`

<br>

```r
car::linearHypothesis(quad_reg, "educ + 20*I(educ^2)=0")
```

```
## Linear hypothesis test
## 
## Hypothesis:
## educ  + 20 I(educ^2) = 0
## 
## Model 1: restricted model
## Model 2: wage ~ female + educ + I(educ^2)
## 
##   Df  Chisq Pr(>Chisq)    
## 1                         
## 2  1 39.831  2.769e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Since the much lower than is 0.01, we can reject the null at the 1% level.

---
class: inverse, center, middle
name: interaction

# Interaction terms

---
class: middle

A variable that is a multiplication of two variables

<br>

`$educ\times exper$`

---

<br>

`$wage = \beta_0 + \beta_1 exper + \beta_2 educ \times exper + u$`

<br>

`$\frac{\partial wage}{\partial exper} = \beta_1+\beta_2\times educ$`

<br>

+ `$\beta_1$`: the marginal impact of experience when `$educ=?$`

+ if `$\beta_2>0$`: additional year of experience is worth more when you have more years of education

---

# Regression with interaction terms

Just like the quadratic case with `$educ^2$`, you can use `I()`.

```r
reg_int <- fixest::feols(wage ~ female + exper + I(exper * educ), data = wage1)
```

<table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;">  (1) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 6.121*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.267) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> female </td>
   <td style="text-align:center;"> −2.418*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.277) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> exper </td>
   <td style="text-align:center;"> −0.188*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.024) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> I(exper * educ) </td>
   <td style="text-align:center;"> 0.020*** </td>
  </tr>
  <tr>
   <td style="text-align:left;box-shadow: 0px 1.5px">  </td>
   <td style="text-align:center;box-shadow: 0px 1.5px"> (0.002) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Std.Errors </td>
   <td style="text-align:center;"> IID </td>
  </tr>
</tbody>
<tfoot><tr><td style="padding: 0; " colspan="100%">
<sup></sup> + p &lt; 0.1, * p &lt; 0.05, ** p &lt; 0.01, *** p &lt; 0.001</td></tr></tfoot>
</table>
  
---
class: middle

`$wage = 6.121 - 2.418 \times female - 0.188 \times exper + 0.020 \times educ \times exper$`

<br>

`$\frac{\partial wage}{\partial exper} = - 0.188 + 0.020 \times educ$`   
---
class: middle

Marginal impact of `$exper$`:

<img src="data:image/png;base64,#modeling_x_files/figure-html/unnamed-chunk-14-1.png" width="90%" style="display: block; margin: auto;" />
]

Histogram of education:

<img src="data:image/png;base64,#modeling_x_files/figure-html/unnamed-chunk-15-1.png" width="90%" style="display: block; margin: auto;" />
]

---
class: middle

+ Just like the case of the quadratic specification of education, marginal impact of experience is not constant

+ We can test if the marginal impact of experience is statistically significant for a given level of education

* When `$educ=10$`, `$\frac{\partial wage}{\partial exper} = - 0.188 + 0.020 \times 10=0.012$`
  * When `$educ=15$`, `$\frac{\partial wage}{\partial exper} = - 0.188 + 0.020 \times 15=0.112$`

---
class: middle

Does additional year of experience has a statistically significant impact (positive or negative) if your current education level is 10

<br>

.content-box-red[**Hypothesis**]
      
+ `$H_0$`: `$\hat{\beta}_{exper} + \hat{\beta}_{exper\_educ} \times 10=0$`

+ `$H_1$`: `$\hat{\beta}_{exper} + \hat{\beta}_{exper\_educ} \times 10=0$`

---
class: middle

```r
car::linearHypothesis(reg_int, "exper+10*I(exper * educ)=0")
```

```
## Linear hypothesis test
## 
## Hypothesis:
## exper+10*I(exper * educ) = 0
## 
## Model 1: restricted model
## Model 2: wage ~ female + exper + I(exper * educ)
## 
##   Df  Chisq Pr(>Chisq)
## 1                     
## 2  1 2.4627     0.1166
```

---

# Including qualitative information

---

# Qualitative information

How do we include qualitative information as an independent variable?

<br>

+ male or female (binary)

+ married or single (binary)

+ high-school, college, masters, or Ph.D (more than two states)

---

# Binary variables

+ Relevant information in binary variables can be captured by a .red[zero-one] variable that takes the value of `$1$` for one state and `$0$` for the other state

+ We use "dummy variable" to refer to a binary (zero-one) variable

<br>

```r
dplyr::select(wage1, wage, educ, exper, female, married) %>%
  head()
```

```
##   wage educ exper female married
## 1 3.10   11     2      1       0
## 2 3.24   12    22      1       1
## 3 3.00   11     2      0       0
## 4 6.00    8    44      0       1
## 5 5.30   12     7      0       1
## 6 8.75   16     9      0       1
```

---
class: middle

`$wage = \beta_0 +\sigma_f female +\beta_2 educ + u$`

<br>

+ `female`: `$E[wage|female=1,educ] = \beta_0 + \sigma_f +\beta_2 educ$`

+ `male`: `$E[wage|female=0,educ] = \beta_0 + \beta_2 educ$`

This means that

`$\sigma_f = E[wage|female=1,educ]-E[wage|female=0,educ]$`

---
class: middle

`$\sigma_f = E[wage|female=1,educ]-E[wage|female=0,educ]$`

Verbally,

+ `$\sigma_f$` is the difference in the expected wage conditional on education between female and male

+ `$\sigma_f$` measures how much more (less) female workers make compared to male workers (.blue[baseline]) if they were to have the same education level

---
class: middle

```r
reg_df <- fixest::feols(wage ~ female + educ, data = wage1)

reg_df
```

```
## OLS estimation, Dep. Var.: wage
## Observations: 526 
## Standard-errors: IID 
##              Estimate Std. Error   t value   Pr(>|t|)    
## (Intercept)  0.622817   0.672533  0.926076 3.5483e-01    
## female      -2.273362   0.279044 -8.146954 2.7642e-15 ***
## educ         0.506452   0.050391 10.050520  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 3.17642   Adj. R2: 0.255985
```

<br>

Female workers make -2.2733619 ($/hour) less than male workers on average even though they have the same education level.

---

---
class: middle

`$wage = \beta_0 +\sigma_m male +\beta_2 educ + u$`

<br>

+ `male`: `$E[wage|male = 1,educ] = \beta_0 + \sigma_m +\beta_2 educ$`

+ `female`: `$E[wage|male = 0,educ] = \beta_0 + \beta_2 educ$`

This means that

`$\sigma_m = E[wage|male=1,educ]-E[wage|male=0,educ]$`

---
class: middle

`$\sigma_m = E[wage|male=1,educ]-E[wage|male=0,educ]$`

Verbally,

+ `$\sigma_m$` is the difference in the expected wage conditional on education between female and male

+ `$\sigma_m$` measures how much more (less) male workers make compared to female workers (.blue[baseline]) if they were to have the same education level

---
class: middle

```r
wage1 <- dplyr::mutate(wage1, male = 1 - female)

reg_df <- fixest::feols(wage ~ male + educ, data = wage1)

reg_df
```

```
## OLS estimation, Dep. Var.: wage
## Observations: 526 
## Standard-errors: IID 
##              Estimate Std. Error  t value   Pr(>|t|)    
## (Intercept) -1.650545   0.652317 -2.53028 1.1689e-02 *  
## male         2.273362   0.279044  8.14695 2.7642e-15 ***
## educ         0.506452   0.050391 10.05052  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 3.17642   Adj. R2: 0.255985
```

<br>

Male workers make 2.2733619 ($/hour) more than female workers on average even though they have the same education level.

---

What do you think will happen if  we include both male and female dummy variables?

<br>

+ They contain redundant information

+ Indeed, including both of them along with the intercept would cause .blue[perfect collinearity problem]

+ So, you .blue[need to] drop either one of them

<br>

intercept = male + female

---
class: middle

Here is what happens if you include both:

```r
reg_dmf <- fixest::feols(wage ~ male + female + educ, data = wage)

reg_dmf
```

```
## OLS estimation, Dep. Var.: wage
## Observations: 526 
## Standard-errors: IID 
##              Estimate Std. Error  t value   Pr(>|t|)    
## (Intercept) -1.650545   0.652317 -2.53028 1.1689e-02 *  
## male         2.273362   0.279044  8.14695 2.7642e-15 ***
## educ         0.506452   0.050391 10.05052  < 2.2e-16 ***
*## ... 1 variable was removed because of collinearity (female)
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 3.17642   Adj. R2: 0.255985
```

---
class: middle

# Interactions with a dummy variable

+ In the previous example, the impact of education on wage was modeled to be exactly the same

+ Can we build a more flexible model that allows us to estimate the differential impacts of education on wage between male and female?

---
class: middle

`$wage = \beta_0 + \sigma_f female +\beta_2 educ + \gamma female\times educ + u$`

+ [female]: `$E[wage|female=1,educ] = \beta_0 + \sigma_f +(\beta_2+\gamma) educ$`
+ [male]: `$E[wage|female=0,educ] = \beta_0 + \beta_2 educ$`

<br>

For female, education is more effective by `$\gamma$` than it is for male.

---

```r
reg_di <- fixest::feols(wage ~ female + educ + `I(female * educ)`, data = wage)
reg_di
```

```
## OLS estimation, Dep. Var.: wage
## Observations: 526 
## Standard-errors: IID 
##                   Estimate Std. Error   t value   Pr(>|t|)    
## (Intercept)       0.200496   0.843562  0.237678 8.1222e-01    
## female           -1.198523   1.325040 -0.904518 3.6614e-01    
## educ              0.539476   0.064223  8.400054 4.2437e-16 ***
## I(female * educ) -0.085999   0.103639 -0.829795 4.0703e-01    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 3.17433   Adj. R2: 0.255542
```

<br>

The marginal benefit of education is 0.086 ($/hour) less for females workers than for male workers on average.

---

---

# Categorical variable: more than two states

+ Consider a variable called `$degree$` which has three status values:  college, master, and doctor.

+ Unlike a binary variable, there are three status values.

+ How do we include a categorical variable like this in a model?

---

You can create three dummy variables likes below:

+ `college`: 1 if the highest degree is college, 0 otherwise
+ `master`: 1 if the highest degree is Master's, 0 otherwise
+ `doctor`: 1 if the highest degree is Ph.D., 0 otherwise

You then include two (the number of status values - 1) of the three dummy variables:

---

`$wage = \beta_0 + \sigma_m master +\sigma_d doctor + \beta_1 educ + u$`

+ [college]: `$E[wage|master=0, doctor = 0, educ] = \beta_0 + \beta_1 educ$`

+ [master]: `$E[wage|master=1, doctor = 0, educ] = \beta_0 + \sigma_m + \beta_1 educ$`

+ [doctor]: `$E[wage|master=0, doctor = 1, educ] = \beta_0 + \sigma_d + \beta_1 educ$`

<br>

`$\sigma_m$`: the impact of having a MS degree .red[relative to] having a .red[college degree]

`$\sigma_d$`: the impact of having a Ph.D. degree .red[relative to] having a .red[college degree]

<br>

The omitted category (here, `college`) becomes the baseline.

---

# Structural differences across groups

Structural difference refers to the fundamental differences in the model of a phenomenon in the population:

---

+ `$cumgpa$`: college grade points averages for male and female college athletes

+ `$sat$`: SAT score

+ `$hsperc$`: high school rank percentile

+ `$tothrs$`: total hours of college courses

<br>

`$cumgpa$` are determined in a fundamentally different manner between female and male students.

You do not want to run a single regression that fits a single model for both female and male students.

---

If you suspect that the underlying process of how the dependent variable is determined vary across groups, then you should test that hypothesis!

<br>

You estimate the model that allows to estimate separate models across groups within a single regression analysis.

`$$cumgpa = \beta_0 + \sigma_0 female + \beta_1 sat + \sigma_1 (sat \times female)$$`
          `$$\;\; + \beta_2 hsperc + \sigma_2 (hsperc \times female)$$`
          `$$\qquad + \beta_3 tothrs + \sigma_3 (tothrs \times female) + u$$`

---

<br>

`$E[cumgpa] = \beta_0 + \beta_1 sat + \beta_2 hsperc + \beta_3 tothrs$`

<br>

`$E[cumgpa] = (\beta_0 +\sigma_0) + (\beta_1+\sigma_1) sat + (\beta_2+\sigma_2) hsperc + (\beta_3+\sigma_3) tothrs$`

<br>

+ `$\beta$`s are commonly shared by female and male students 
+ `$\sigma$`s capture the differences between female and male students

---

The model of GPA for male and female students are not structurally different.

<br>

`$H_0: \;\; \sigma_0=0,\;\; \sigma_1=0, \;\; \sigma_2=0, \;\; \mbox{and} \;\; \sigma_3=0$`

<br>

What test do we do? t-test or F-test?

---
class: middle

Run the unrestricted model with all the interaction terms:

```r
data("gpa3", package = "wooldridge")

gpa <-
  gpa3 %>%
  dplyr::filter(!is.na(ctothrs)) %>%
  #--- create interaction terms ---#
  dplyr::mutate(
    female_sat := female * sat,
    female_hsperc := female * hsperc,
    female_tothrs := female * tothrs
  )

#--- regression with female dummy ---#
reg_full <-
  fixest::feols(
    cumgpa ~
      female + sat + female_sat + hsperc + female_hsperc +
      tothrs + female_tothrs,
    data = gpa
  )
```

---
class: middle

+ None of the variables that involve `$female$` are statistically significant at the 5% level individually.

+ Does this mean that `$male$` and `$female$` students have the same regression function?

+ No, we are testing the joint significance of the coefficients. We need to do an `$F$`-test!

]

.right5[
<table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;">  (1) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 1.481*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.207) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> female </td>
   <td style="text-align:center;"> −0.353 </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.411) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sat </td>
   <td style="text-align:center;"> 0.001*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.000) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> female_sat </td>
   <td style="text-align:center;"> 0.001+ </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.000) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> hsperc </td>
   <td style="text-align:center;"> −0.008*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.001) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> female_hsperc </td>
   <td style="text-align:center;"> −0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.003) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> tothrs </td>
   <td style="text-align:center;"> 0.002** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.001) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> female_tothrs </td>
   <td style="text-align:center;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.002) </td>
  </tr>
</tbody>
<tfoot><tr><td style="padding: 0; " colspan="100%">
<sup></sup> + p &lt; 0.1, * p &lt; 0.05, ** p &lt; 0.01, *** p &lt; 0.001</td></tr></tfoot>
</table>
]

---

```r
car::linearHypothesis(
  reg_full,
  c(
    "female = 0",
    "female_hsperc = 0",
    "female_sat = 0",
    "female_tothrs = 0"
  )
)
```

```
## Linear hypothesis test
## 
## Hypothesis:
## female = 0
## female_hsperc = 0
## female_sat = 0
## female_tothrs = 0
## 
## Model 1: restricted model
## Model 2: cumgpa ~ female + sat + female_sat + hsperc + female_hsperc + 
##     tothrs + female_tothrs
## 
##   Df  Chisq Pr(>Chisq)    
## 1                         
## 2  4 32.716  1.365e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
class: inverse, center, middle
name: use-i

# R coding tips: categorical variables and interaction terms

---
class: middle

# R coding tips: categorical variables and interaction terms

```r
#* load the package to access the data we want
library(wooldridge)

#* get big9salary
data("big9salary")

#* creat a variable that indicates university
#* this is how the data would like most of the time (instead of having bunch of dummy variables)
big9salary_c <-
  as_tibble(big9salary) %>%
  dplyr::mutate(
    university =
      case_when(
        osu == 1 ~ "Ohio State U",
        iowa == 1 ~ "U of Iowa",
        indiana == 1 ~ "Indiana U",
        purdue == 1 ~ "Purdue U",
        msu == 1 ~ "Michigan State U",
        mich == 1 ~ "Michigan U",
        wisc == 1 ~ "U of Wisconsin",
        illinois == 1 ~ "U of Illinois"
      )
  ) %>%
  dplyr::relocate(id, year, salary, pubindx, university)
```

---
class: middle

Take a look at the data,

```r
head(big9salary_c)
```

```
## # A tibble: 6 × 31
##      id  year salary pubindx university totpge assist assoc  prof chair top20phd yearphd female   osu  iowa indiana purdue   msu  minn  mich  wisc illinois   y92   y95   y99 lsalary exper expersq pubindxsq pubindx0 lpubindx
##   <int> <int>  <int>   <dbl> <chr>       <dbl>  <int> <int> <int> <int>    <int>   <int>  <int> <int> <int>   <int>  <int> <int> <int> <int> <int>    <int> <int> <int> <int>   <dbl> <int>   <int>     <dbl>    <dbl>    <dbl>
## 1   101    92     NA    30.5 Indiana U    92.7      0     0     1     0        0      73      0     0     0       1      0     0     0     0     0        0     1     0     0    NA      19     361      933.        0     3.42
## 2   101    95     NA    31.0 Indiana U   107.       0     0     1     0        0      73      0     0     0       1      0     0     0     0     0        0     0     1     0    NA      22     484      959.        0     3.43
## 3   101    99 107100    40.5 Indiana U   186.       0     0     1     0        0      73      0     0     0       1      0     0     0     0     0        0     0     0     1    11.6    26     676     1636.        0     3.70
## 4   102    92  79420    33.5 Indiana U   128.       0     0     1     0        0      76      0     0     0       1      0     0     0     0     0        0     1     0     0    11.3    16     256     1125.        0     3.51
## 5   102    95  88239    33.9 Indiana U   133        0     0     1     0        0      76      0     0     0       1      0     0     0     0     0        0     0     1     0    11.4    19     361     1149.        0     3.52
## 6   102    99 100450    36.2 Indiana U   192.       0     0     1     0        0      76      0     0     0       1      0     0     0     0     0        0     0     0     1    11.5    23     529     1313.        0     3.59
```

```r
tail(big9salary_c)
```

```
## # A tibble: 6 × 31
##      id  year salary pubindx university     totpge assist assoc  prof chair top20phd yearphd female   osu  iowa indiana purdue   msu  minn  mich  wisc illinois   y92   y95   y99 lsalary exper expersq pubindxsq pubindx0 lpubindx
##   <int> <int>  <int>   <dbl> <chr>           <dbl>  <int> <int> <int> <int>    <int>   <int>  <int> <int> <int>   <int>  <int> <int> <int> <int> <int>    <int> <int> <int> <int>   <dbl> <int>   <int>     <dbl>    <dbl>    <dbl>
## 1   932    92  90856   72.7  U of Wisconsin  269.       0     0     1     0        1      73      1     0     0       0      0     0     0     0     1        0     1     0     0    11.4    19     361   5287.          0    4.29 
## 2   932    95 110090   73.5  U of Wisconsin  294        0     0     1     0        1      73      1     0     0       0      0     0     0     0     1        0     0     1     0    11.6    22     484   5396.          0    4.30 
## 3   932    99 122397   75.2  U of Wisconsin  315        0     0     1     0        1      73      1     0     0       0      0     0     0     0     1        0     0     0     1    11.7    26     676   5649.          0    4.32 
## 4   933    92  45755    2.19 U of Wisconsin    9.5      1     0     0     0        1      91      0     0     0       0      0     0     0     0     1        0     1     0     0    10.7     1       1      4.80        0    0.784
## 5   933    95  51846    8.11 U of Wisconsin   88        1     0     0     0        1      92      0     0     0       0      0     0     0     0     1        0     0     1     0    10.9     3       9     65.8         0    2.09 
## 6   933    99  69630   59.5  U of Wisconsin  208.       0     1     0     0        1      93      0     0     0       0      0     0     0     0     1        0     0     0     1    11.2     6      36   3534.          0    4.09
```

---
class: middle

You can use the `i()` function inside `fixest::feols()` like below:

```r
fixest::feols(salary ~ pubindx + female + `i(university, ref = "Indiana U")`, data = big9salary_c) %>%
  broom::tidy()
```

```
## # A tibble: 10 × 5
##    term                         estimate std.error statistic  p.value
##    <chr>                           <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                    74544.    3001.     24.8   9.34e-94
##  2 pubindx                          346.      26.6    13.0   3.18e-34
##  3 female                         -5877.    3067.     -1.92  5.59e- 2
*##  4 university::Michigan State U   -9188.    3631.     -2.53  1.17e- 2
##  5 university::Michigan U        -11561.    3833.     -3.02  2.67e- 3
##  6 university::Ohio State U       -4707.    3790.     -1.24  2.15e- 1
##  7 university::Purdue U          -10517.    4310.     -2.44  1.50e- 2
##  8 university::U of Illinois      -1809.    3686.     -0.491 6.24e- 1
##  9 university::U of Iowa           -519.    3951.     -0.131 8.95e- 1
## 10 university::U of Wisconsin     -6840.    4186.     -1.63  1.03e- 1
```

`ref = "Indiana U"` sets the base category to `"Indiana U"`.

So, for example, the highlighted line means that faculty memebers at Michigan State U make `$9,118$` USD less annually than those at Indiana U.

<br>

You do not have to make bunch of dummy variables like the original dataset. Just use `i(catergory_variable)`.

---
class: middle

# Interactions terms

You can use `i()` for creating interactions of a categorical variable and a continuous variable.

Suppose you are interested in understanding the impact of `pubindx` (continuous) by `university` (categorical), then

```r
fixest::feols(salary ~ female + pubindx + i(university, ref = "Indiana U") + `i(university, totpge, ref = "Indiana U")`, data = big9salary_c) %>%
  broom::tidy()
```

```
## # A tibble: 17 × 5
##    term                                 estimate std.error statistic  p.value
##    <chr>                                   <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                           79593.      4267.    18.7   3.02e-61
##  2 female                                -3782.      3113.    -1.21  2.25e- 1
##  3 pubindx                                  42.5      172.     0.247 8.05e- 1
##  4 university::Michigan State U         -17995.      5190.    -3.47  5.65e- 4
##  5 university::Michigan U               -13162.      5577.    -2.36  1.86e- 2
##  6 university::Ohio State U             -10073.      5633.    -1.79  7.42e- 2
##  7 university::Purdue U                 -19022.      6291.    -3.02  2.61e- 3
##  8 university::U of Illinois            -12818.      5568.    -2.30  2.17e- 2
##  9 university::U of Iowa                -11785.      5510.    -2.14  3.29e- 2
## 10 university::U of Wisconsin            -8197.      6132.    -1.34  1.82e- 1
*## 11 university::Michigan State U:pubindx    436.       191.     2.29  2.25e- 2
## 12 university::Michigan U:pubindx          253.       177.     1.43  1.54e- 1
## 13 university::Ohio State U:pubindx        305.       185.     1.65  9.96e- 2
## 14 university::Purdue U:pubindx            422.       212.     2.00  4.65e- 2
## 15 university::U of Illinois:pubindx       594.       225.     2.64  8.44e- 3
## 16 university::U of Iowa:pubindx           588.       206.     2.85  4.50e- 3
## 17 university::U of Wisconsin:pubindx      247.       180.     1.37  1.70e- 1
```

So, the marginal impact of `pubindex` is `$436$` greater for those at Michigan State U than those at Indiana U.

---

# Other miscellaneous topic

---

# Goodness of fit: `$R^2$`

Small value of `$R^2$` does not mean the end of the world (In fact, we could not care less about it.)

---

`$$ecolabs = \beta_0 + \beta_1 regprc + \beta_2 ecoprc$$`

+ `$ecolabs$`: the (hypothetical) pounds of ecologically friendly (ecolabled) apples a family would demand
+ `$regprc$`: prices of regular apples 
+ `$ecoprc$`: prices of the hypothetical ecolabled apples

<br>

.content-box-red[**Key**]
+ The data was obtained via survey and `$ecoprc$` was set randomly (So, we know `$E[u|x] = 0$`) by the researcher.
+ The (only) objective of the study is to understand the impact of the price of ecolabled apple on the demand for ecolabled apples.

---

% Error: Unrecognized object type.
 
---

Suppose you are challenged by somebody who claim that your regression is not .blue[good] because the `$R^2$` is tiny. How would your respond to his/her attack?

---

# Scaling

---

What happens if you scale up/down variables used in regression?

+ coefficients
+ standard errors
+ t-statistics
+ `$R^2$`

---

```r
#--- regression with original scale ---#
reg_no_scale <- fixest::feols(wage ~ female + educ, data = wage1)

#--- regression with scaled educ ---#
reg_scale <- fixest::feols(wage ~ female + I(educ * 12), data = wage1)
```
  
---

```r
modelsummary::msummary(
  list(reg_no_scale, reg_scale),
  stars = TRUE,
  gof_omit = "IC|Log|Adj|F|Pseudo|Within"
)
```

<br>

+ coefficient: 1/12
+ standard error: 1/12
+ t-stat: the same
+ `$R^2$`: the same

]

.right5[
<table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;">  (1) </th>
   <th style="text-align:center;">   (2) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 0.623 </td>
   <td style="text-align:center;"> 0.623 </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.673) </td>
   <td style="text-align:center;"> (0.673) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> female </td>
   <td style="text-align:center;"> −2.273*** </td>
   <td style="text-align:center;"> −2.273*** </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.279) </td>
   <td style="text-align:center;"> (0.279) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> educ </td>
   <td style="text-align:center;"> 0.506*** </td>
   <td style="text-align:center;">  </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.050) </td>
   <td style="text-align:center;">  </td>
  </tr>
  <tr>
   <td style="text-align:left;"> I(educ * 12) </td>
   <td style="text-align:center;">  </td>
   <td style="text-align:center;"> 0.042*** </td>
  </tr>
  <tr>
   <td style="text-align:left;box-shadow: 0px 1.5px">  </td>
   <td style="text-align:center;box-shadow: 0px 1.5px">  </td>
   <td style="text-align:center;box-shadow: 0px 1.5px"> (0.004) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Num.Obs. </td>
   <td style="text-align:center;"> 526 </td>
   <td style="text-align:center;"> 526 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> R2 </td>
   <td style="text-align:center;"> 0.259 </td>
   <td style="text-align:center;"> 0.259 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> RMSE </td>
   <td style="text-align:center;"> 3.18 </td>
   <td style="text-align:center;"> 3.18 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Std.Errors </td>
   <td style="text-align:center;"> IID </td>
   <td style="text-align:center;"> IID </td>
  </tr>
</tbody>
<tfoot><tr><td style="padding: 0; " colspan="100%">
<sup></sup> + p &lt; 0.1, * p &lt; 0.05, ** p &lt; 0.01, *** p &lt; 0.001</td></tr></tfoot>
</table>
]

---

+ Regression .blue[without] scaling

hourly wage increases by `$0.506$` if education increases by a .blue[year]

+ Regression .blue[with] scaling (e.g., 48 means 4 years)

hourly wage increases by `$0.0422$` if education increases by a .blue[month]

<br>

According to the scaled model, hourly wage increases by `$0.0422 * 12$` if education increases by a year (12 months).

That is, the estimated marginal impact of education on wage from the scaled model is the same as that from the non-scaled model.

---

When an independent variable is scaled,

+ its coefficient estimate and standard error are going to be scaled up/back to the exact degree the variable is scaled up/back
+ t-statistics stays the same (as it should be)
+ `$R^2$` stays the same (the model does not improve by simply scaling independent variables)