class: center, middle, inverse, title-slide .title[ # Econometric Modeling ] .author[ ### AECN 396/896-002 ] --- class: middle layout: true --- <style type="text/css"> @media print { .has-continuation { display: block !important; } } .remark-slide-content.hljs-github h1 { margin-top: 5px; margin-bottom: 25px; } .remark-slide-content.hljs-github { padding-top: 10px; padding-left: 30px; padding-right: 30px; } .panel-tabs { <!-- color: #062A00; --> color: #841F27; margin-top: 0px; margin-bottom: 0px; margin-left: 0px; padding-bottom: 0px; } .panel-tab { margin-top: 0px; margin-bottom: 0px; margin-left: 3px; margin-right: 3px; padding-top: 0px; padding-bottom: 0px; } .panelset .panel-tabs .panel-tab { min-height: 40px; } .remark-slide th { border-bottom: 1px solid #ddd; } .remark-slide thead { border-bottom: 0px; } .gt_footnote { padding: 2px; } .remark-slide table { border-collapse: collapse; } .remark-slide tbody { border-bottom: 2px solid #666; } .important { background-color: lightpink; border: 2px solid blue; font-weight: bold; } .remark-code { display: block; overflow-x: auto; padding: .5em; background: #ffe7e7; } .hljs-github .hljs { background: #f2f2fd; } .remark-inline-code { padding-top: 0px; padding-bottom: 0px; background-color: #e6e6e6; } .r.hljs.remark-code.remark-inline-code{ font-size: 0.9em } .left-full { width: 80%; height: 92%; float: left; } .left-code { width: 38%; height: 92%; float: left; } .right-plot { width: 60%; float: right; padding-left: 1%; } .left5 { width: 49%; height: 92%; float: left; } .right5 { width: 49%; float: right; padding-left: 1%; } .left3 { width: 29%; height: 92%; float: left; } .right7 { width: 69%; float: right; padding-left: 1%; } .left4 { width: 38%; height: 92%; float: left; } .right6 { width: 60%; float: right; padding-left: 1%; } ul li{ margin: 7px; } ul, li{ margin-left: 15px; padding-left: 0px; } ol li{ margin: 7px; } ol, li{ margin-left: 15px; padding-left: 0px; } </style> <style type="text/css"> .content-box { box-sizing: border-box; background-color: #e2e2e2; } .content-box-blue, .content-box-gray, .content-box-grey, .content-box-army, .content-box-green, .content-box-purple, .content-box-red, .content-box-yellow { box-sizing: border-box; border-radius: 5px; margin: 0 0 10px; overflow: hidden; padding: 0px 5px 0px 5px; width: 100%; } .content-box-blue { background-color: #F0F8FF; } .content-box-gray { background-color: #e2e2e2; } .content-box-grey { background-color: #F5F5F5; } .content-box-army { background-color: #737a36; } .content-box-green { background-color: #d9edc2; } .content-box-purple { background-color: #e2e2f9; } .content-box-red { background-color: #ffcccc; } .content-box-yellow { background-color: #fef5c4; } .content-box-blue .remark-inline-code, .content-box-blue .remark-inline-code, .content-box-gray .remark-inline-code, .content-box-grey .remark-inline-code, .content-box-army .remark-inline-code, .content-box-green .remark-inline-code, .content-box-purple .remark-inline-code, .content-box-red .remark-inline-code, .content-box-yellow .remark-inline-code { background: none; } .full-width { display: flex; width: 100%; flex: 1 1 auto; } </style> <style type="text/css"> blockquote, .blockquote { display: block; margin-top: 0.1em; margin-bottom: 0.2em; margin-left: 5px; margin-right: 5px; border-left: solid 10px #0148A4; border-top: solid 2px #0148A4; border-bottom: solid 2px #0148A4; border-right: solid 2px #0148A4; box-shadow: 0 0 6px rgba(0,0,0,0.5); /* background-color: #e64626; */ color: #e64626; padding: 0.5em; -moz-border-radius: 5px; -webkit-border-radius: 5px; } .blockquote p { margin-top: 0px; margin-bottom: 5px; } .blockquote > h1:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h2:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h3:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h4:first-of-type { margin-top: 0px; margin-bottom: 5px; } .text-shadow { text-shadow: 0 0 4px #424242; } </style> <style type="text/css"> /****************** * Slide scrolling * (non-functional) * not sure if it is a good idea anyway slides > slide { overflow: scroll; padding: 5px 40px; } .scrollable-slide .remark-slide { height: 400px; overflow: scroll !important; } ******************/ .scroll-box-8 { height:8em; overflow-y: scroll; } .scroll-box-10 { height:10em; overflow-y: scroll; } .scroll-box-12 { height:12em; overflow-y: scroll; } .scroll-box-14 { height:14em; overflow-y: scroll; } .scroll-box-16 { height:16em; overflow-y: scroll; } .scroll-box-18 { height:18em; overflow-y: scroll; } .scroll-box-20 { height:20em; overflow-y: scroll; } .scroll-box-24 { height:24em; overflow-y: scroll; } .scroll-box-30 { height:30em; overflow-y: scroll; } .scroll-output { height: 90%; overflow-y: scroll; } </style> # Before we start ## Learning objectives 1. Enhance the understanding of the interpretation of various models 1. Post-estimation simulation ## Table of contents 1. [Expanding on Simple Models](#various-models) 2. [Interaction terms](#interaction) 3. [Categorical variable](#qualitative) 4. [R coding tips: categorical variables and interaction terms](#use-i) 5. [Other miscellaneous topics](#misc) --- class: inverse, center, middle name: various-models # More on functional forms <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- # Various econometric models .content-box-red[**log-linear**] `\(log(y_i)= \beta_0+\beta_1 x_i + u_i\)` <br> .content-box-red[**linear-log**] `\(y_i= \beta_0+\beta_1 log(x_i) + u_i\)` <br> .content-box-red[**log-log**] `\(log(y_i)= \beta_0+\beta_1 log(x_i) + u_i\)` -- <br> .content-box-red[**quadratic**] `\(y_i= \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + u_i\)` --- # Quadratic .content-box-red[**Model**] `\(y_i= \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + u_i\)` -- <br> .content-box-red[**Calculus**] Differentiating the both sides wrt `\(x_i\)`, `\(\frac{\partial y_i}{\partial x_i} = \beta_1 + 2*\beta_2 x_i\Rightarrow \Delta y_i = (\beta_1 + 2*\beta_2 x_i)\Delta x_i\)` <br> .content-box-red[**Interpretation**] When `\(x\)` increases by 1 unit `\((\Delta x_i=1)\)`, `\(y\)` increases by `\(\beta_1 + 2*\beta_2 x_i\)` --- # Visualization Quadratic functional form is quite flexible. .left5[ `\(y = x + x^2\)` `\((\beta_1 = 1, \beta_2 = 1)\)` <img src="data:image/png;base64,#modeling_x_files/figure-html/unnamed-chunk-4-1.png" width="90%" height="80%" style="display: block; margin: auto;" /> ] .right5[ `\(y = 3x-2x^2\)` `\((\beta_1 = 3, \beta_2 = -2)\)` <img src="data:image/png;base64,#modeling_x_files/figure-html/unnamed-chunk-5-1.png" width="90%" height="80%" style="display: block; margin: auto;" /> ] --- # Example .content-box-red[**Education impacts of income**] The marginal impact of education (the impact of a small change in education on income) may differ what level of education you have had: + How much does it help to have two more years of education when you have had education until elementary school? + How much does it help to have two more years of education when you have graduated a college? + How much does it help to spend two more years as a Ph.D student if you have already spent six years in a Ph.D program -- <br> .content-box-green[**Observation**] The marginal impact of education does not seem to be linear. --- # Implementation in R When you include a variable that is a transformation of an existing variable, use `I()` function in which you write the mathematical expression of the desired transformation. ```r #--- prepare a dataset ---# data("wage1", package = "wooldridge") #--- run a regression ---# quad_reg <- fixest::feols(wage ~ female + educ + I(educ^2), data = wage1) #--- look at the results ---# broom::tidy(quad_reg) ``` ``` ## # A tibble: 4 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 5.61 1.38 4.05 5.91e- 5 ## 2 female -2.13 0.277 -7.67 8.50e-14 ## 3 educ -0.416 0.231 -1.81 7.14e- 2 ## 4 I(educ^2) 0.0395 0.00964 4.10 4.80e- 5 ``` --- .content-box-red[**Estimated Model**] `\(wage = 5.60 - 2.12\times female -0.416\times educ + 0.039\times educ^2\)` <img src="data:image/png;base64,#modeling_x_files/figure-html/unnamed-chunk-7-1.png" width="60%" style="display: block; margin: auto;" /> --- .content-box-red[**Estimated Model**] `\(wage = 5.60 - 2.12\times female -0.416\times educ + 0.039\times educ^2\)` <br> -- .content-box-green[**Problem**] What is the marginal impact of `\(educ\)`? `\(\frac{\partial wage}{\partial educ} = ?\)` -- <br> .content-box-green[**Answer**] `\(\frac{\partial wage}{\partial educ} = -0.416+0.039\times 2\times educ\)` -- + When `\(educ = 4\)`, additional year of education is going to increase hourly wage by -0.104 on average -- + When `\(educ = 10\)`, additional year of education is going to increase hourly wage by 0.364 on average --- # Statistical significance of the marginal impact The marginal impact of `\(educ\)` is: `\(\frac{\partial wage}{\partial educ} = -0.416+0.039\times 2\times educ\)` + `\(educ\)`: `\(-0.416\)` `\((t\)`-stat `\(= -1.80)\)` + `\(educ^2\)`: `\(0.039\)` `\((t\)`-stat `\(= 4.10)\)` -- <br> .content-box-green[**Question**] So, is the marginal impact of `\(educ\)` statistically significantly different from `\(0\)`? --- # In the linear case ```r linear_reg <- fixest::feols(wage ~ female + educ, data = wage1) broom::tidy(linear_reg) ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.623 0.673 0.926 3.55e- 1 ## 2 female -2.27 0.279 -8.15 2.76e-15 ## 3 educ 0.506 0.0504 10.1 7.56e-22 ``` -- <br> .content-box-red[**Estimated model**] `\(wage = 0.62+0.51 \times educ\)` --- class: middle .content-box-red[**Estimated model**] `\(wage = 0.62+0.51 \times educ\)` -- <br> .content-box-green[**Question**] + What is the marginal impact of `\(educ\)`? -- 0.51 -- + Does the marginal impact of education vary depending on the level of education? -- No, the model we estimated assumed that the marginal impact of education is constant. -- <br> .content-box-red[**Testing**] You can just test if `\(\hat{\beta}_{educ}\)` (the marginal impact of education) is statistically significantly different from `\(0\)`, which is just a t-test. --- # Going back to the quadratic case With the quadratic specification + The marginal impact of education varies depending on your education level -- + There is no single test that tells you whether the marginal impact of education is statistically significant universally -- + Indeed, you need different tests for different values education levels --- # Example 1 .content-box-red[**Marginal impact of education**] `\(\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times educ\)` -- <br> .content-box-red[**Hypothesis testing**] Does additional year of education has a statistically significant impact (positive or negative) if your current education level is 4? + `\(H_0\)`: `\(\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 4 =0\)` + `\(H_1\)`: `\(\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 4 \ne 0\)` -- <br> .content-box-green[**Question**] Is this + test of a single coefficient? (t-test) + test of a single equation with multiple coefficients? (t-test) + test of multiples equations with multiple coefficients? (F-test) --- .content-box-red[**t-statistic**] `\(t = \frac{\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 4}{se(\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 4)} = \frac{\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 8}{se(\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 8)}\)` -- <br> .content-box-red[**R implementation**] Remember, a trick to do this test using R is take advantage of the fact that `\(F_{1, n-k-1} \sim t_{n-k-1}^2\)`. ```r car::linearHypothesis(quad_reg, "educ + 8*I(educ^2)=0") ``` ``` ## Linear hypothesis test ## ## Hypothesis: ## educ + 8 I(educ^2) = 0 ## ## Model 1: restricted model ## Model 2: wage ~ female + educ + I(educ^2) ## ## Df Chisq Pr(>Chisq) ## 1 ## 2 1 0.4126 0.5207 ``` -- Since the p-value is 0.529, we do not reject the null. --- class: middle # Example 2 .content-box-red[**Marginal impact of education**] `\(\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times educ\)` -- <br> .content-box-red[**Hypothesis testing**] Does additional year of education has a statistically significant impact (positive or negative) if your current education level is 4? + `\(H_0\)`: `\(\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 10 =0\)` + `\(H_1\)`: `\(\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 10 \ne 0\)` -- <br> .content-box-green[**Question**] Is this + test of a single coefficient? (t-test) + test of a single equation with multiple coefficients? (t-test) + test of multiples equations with multiple coefficients? (F-test) --- .content-box-red[**t-statistic**] `\(t = \frac{\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 10}{se(\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 2 \times 10)} = \frac{\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 20}{se(\hat{\beta}_{educ} + \hat{\beta}_{educ^2} \times 20)}\)` -- <br> .content-box-red[**R implementation**] ```r car::linearHypothesis(quad_reg, "educ + 20*I(educ^2)=0") ``` ``` ## Linear hypothesis test ## ## Hypothesis: ## educ + 20 I(educ^2) = 0 ## ## Model 1: restricted model ## Model 2: wage ~ female + educ + I(educ^2) ## ## Df Chisq Pr(>Chisq) ## 1 ## 2 1 39.831 2.769e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` -- Since the much lower than is 0.01, we can reject the null at the 1% level. --- class: inverse, center, middle name: interaction # Interaction terms <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- class: middle .content-box-red[**An interaction term**] A variable that is a multiplication of two variables -- <br> .content-box-red[**Example**] `\(educ\times exper\)` --- <br> .content-box-red[**A model with an interaction term**] `\(wage = \beta_0 + \beta_1 exper + \beta_2 educ \times exper + u\)` -- <br> .content-box-red[**Marginal impact of experience**] `\(\frac{\partial wage}{\partial exper} = \beta_1+\beta_2\times educ\)` -- <br> .content-box-red[**Implications**] The marginal impact of experience depends on education + `\(\beta_1\)`: the marginal impact of experience when `\(educ=?\)` + if `\(\beta_2>0\)`: additional year of experience is worth more when you have more years of education --- # Regression with interaction terms Just like the quadratic case with `\(educ^2\)`, you can use `I()`. ```r reg_int <- fixest::feols(wage ~ female + exper + I(exper * educ), data = wage1) ``` -- <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;">  (1) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 6.121*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.267) </td> </tr> <tr> <td style="text-align:left;"> female </td> <td style="text-align:center;"> −2.418*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.277) </td> </tr> <tr> <td style="text-align:left;"> exper </td> <td style="text-align:center;"> −0.188*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.024) </td> </tr> <tr> <td style="text-align:left;"> I(exper * educ) </td> <td style="text-align:center;"> 0.020*** </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1.5px"> </td> <td style="text-align:center;box-shadow: 0px 1.5px"> (0.002) </td> </tr> <tr> <td style="text-align:left;"> Std.Errors </td> <td style="text-align:center;"> IID </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> --- class: middle .content-box-red[**Estimated Model**] `\(wage = 6.121 - 2.418 \times female - 0.188 \times exper + 0.020 \times educ \times exper\)` <br> .content-box-red[**Marginal impact of experience**] `\(\frac{\partial wage}{\partial exper} = - 0.188 + 0.020 \times educ\)` --- class: middle <br> <br> .left5[ Marginal impact of `\(exper\)`: <img src="data:image/png;base64,#modeling_x_files/figure-html/unnamed-chunk-14-1.png" width="90%" style="display: block; margin: auto;" /> ] .right5[ Histogram of education: <img src="data:image/png;base64,#modeling_x_files/figure-html/unnamed-chunk-15-1.png" width="90%" style="display: block; margin: auto;" /> ] --- class: middle .content-box-red[**Test of marginal impacts**] + Just like the case of the quadratic specification of education, marginal impact of experience is not constant + We can test if the marginal impact of experience is statistically significant for a given level of education * When `\(educ=10\)`, `\(\frac{\partial wage}{\partial exper} = - 0.188 + 0.020 \times 10=0.012\)` * When `\(educ=15\)`, `\(\frac{\partial wage}{\partial exper} = - 0.188 + 0.020 \times 15=0.112\)` --- class: middle .content-box-red[**Question**] Does additional year of experience has a statistically significant impact (positive or negative) if your current education level is 10 <br> .content-box-red[**Hypothesis**] + `\(H_0\)`: `\(\hat{\beta}_{exper} + \hat{\beta}_{exper\_educ} \times 10=0\)` + `\(H_1\)`: `\(\hat{\beta}_{exper} + \hat{\beta}_{exper\_educ} \times 10=0\)` --- class: middle .content-box-red[**R implementation**] ```r car::linearHypothesis(reg_int, "exper+10*I(exper * educ)=0") ``` ``` ## Linear hypothesis test ## ## Hypothesis: ## exper+10*I(exper * educ) = 0 ## ## Model 1: restricted model ## Model 2: wage ~ female + exper + I(exper * educ) ## ## Df Chisq Pr(>Chisq) ## 1 ## 2 1 2.4627 0.1166 ``` --- class: inverse, center, middle name: qualitative # Including qualitative information <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- # Qualitative information .content-box-green[**Issue**] How do we include qualitative information as an independent variable? -- <br> .content-box-green[**Examples**] + male or female (binary) + married or single (binary) + high-school, college, masters, or Ph.D (more than two states) --- # Binary variables .content-box-red[**Dummy variable**] + Relevant information in binary variables can be captured by a .red[zero-one] variable that takes the value of `\(1\)` for one state and `\(0\)` for the other state + We use "dummy variable" to refer to a binary (zero-one) variable <br> .content-box-red[**Example**] ```r dplyr::select(wage1, wage, educ, exper, female, married) %>% head() ``` ``` ## wage educ exper female married ## 1 3.10 11 2 1 0 ## 2 3.24 12 22 1 1 ## 3 3.00 11 2 0 0 ## 4 6.00 8 44 0 1 ## 5 5.30 12 7 0 1 ## 6 8.75 16 9 0 1 ``` --- class: middle .content-box-red[**Model with dummy a variable**] `\(wage = \beta_0 +\sigma_f female +\beta_2 educ + u\)` <br> .content-box-red[**Interpretation**] + `female`: `\(E[wage|female=1,educ] = \beta_0 + \sigma_f +\beta_2 educ\)` + `male`: `\(E[wage|female=0,educ] = \beta_0 + \beta_2 educ\)` -- This means that `\(\sigma_f = E[wage|female=1,educ]-E[wage|female=0,educ]\)` --- class: middle `\(\sigma_f = E[wage|female=1,educ]-E[wage|female=0,educ]\)` Verbally, + `\(\sigma_f\)` is the difference in the expected wage conditional on education between female and male + `\(\sigma_f\)` measures how much more (less) female workers make compared to male workers (.blue[baseline]) if they were to have the same education level --- class: middle .content-box-red[**Regression with a dummy variable**] ```r reg_df <- fixest::feols(wage ~ female + educ, data = wage1) reg_df ``` ``` ## OLS estimation, Dep. Var.: wage ## Observations: 526 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.622817 0.672533 0.926076 3.5483e-01 ## female -2.273362 0.279044 -8.146954 2.7642e-15 *** ## educ 0.506452 0.050391 10.050520 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 3.17642 Adj. R2: 0.255985 ``` <br> .content-box-red[**Interpretation**] Female workers make -2.2733619 ($/hour) less than male workers on average even though they have the same education level. --- .content-box-red[**Visualization of the estimated model**] <img src="data:image/png;base64,#modeling_x_files/figure-html/unnamed-chunk-19-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle .content-box-red[**Model with dummy a variable**] `\(wage = \beta_0 +\sigma_m male +\beta_2 educ + u\)` <br> .content-box-red[**Interpretation**] + `male`: `\(E[wage|male = 1,educ] = \beta_0 + \sigma_m +\beta_2 educ\)` + `female`: `\(E[wage|male = 0,educ] = \beta_0 + \beta_2 educ\)` -- This means that `\(\sigma_m = E[wage|male=1,educ]-E[wage|male=0,educ]\)` --- class: middle `\(\sigma_m = E[wage|male=1,educ]-E[wage|male=0,educ]\)` Verbally, + `\(\sigma_m\)` is the difference in the expected wage conditional on education between female and male + `\(\sigma_m\)` measures how much more (less) male workers make compared to female workers (.blue[baseline]) if they were to have the same education level .red[Important]: whichever status that is given the value of `\(0\)` becomes the baseline --- class: middle .content-box-red[**Regression with a dummy variable**] ```r wage1 <- dplyr::mutate(wage1, male = 1 - female) reg_df <- fixest::feols(wage ~ male + educ, data = wage1) reg_df ``` ``` ## OLS estimation, Dep. Var.: wage ## Observations: 526 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.650545 0.652317 -2.53028 1.1689e-02 * ## male 2.273362 0.279044 8.14695 2.7642e-15 *** ## educ 0.506452 0.050391 10.05052 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 3.17642 Adj. R2: 0.255985 ``` <br> .content-box-red[**Interpretation**] Male workers make 2.2733619 ($/hour) more than female workers on average even though they have the same education level. --- class: middle .content-box-red[**Question**] What do you think will happen if we include both male and female dummy variables? -- <br> .content-box-red[**Answer**] + They contain redundant information + Indeed, including both of them along with the intercept would cause .blue[perfect collinearity problem] + So, you .blue[need to] drop either one of them -- <br> .content-box-red[**Perfect Collinearity**] intercept = male + female --- class: middle Here is what happens if you include both: ```r reg_dmf <- fixest::feols(wage ~ male + female + educ, data = wage) reg_dmf ``` ``` ## OLS estimation, Dep. Var.: wage ## Observations: 526 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.650545 0.652317 -2.53028 1.1689e-02 * ## male 2.273362 0.279044 8.14695 2.7642e-15 *** ## educ 0.506452 0.050391 10.05052 < 2.2e-16 *** *## ... 1 variable was removed because of collinearity (female) ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 3.17642 Adj. R2: 0.255985 ``` --- class: middle # Interactions with a dummy variable .content-box-green[**Issue**] + In the previous example, the impact of education on wage was modeled to be exactly the same + Can we build a more flexible model that allows us to estimate the differential impacts of education on wage between male and female? --- class: middle .content-box-red[**A more flexible model**] `\(wage = \beta_0 + \sigma_f female +\beta_2 educ + \gamma female\times educ + u\)` + [female]: `\(E[wage|female=1,educ] = \beta_0 + \sigma_f +(\beta_2+\gamma) educ\)` + [male]: `\(E[wage|female=0,educ] = \beta_0 + \beta_2 educ\)` <br> .content-box-red[**Interpretation**] For female, education is more effective by `\(\gamma\)` than it is for male. --- .content-box-red[**Example using R**] ```r reg_di <- fixest::feols(wage ~ female + educ + `I(female * educ)`, data = wage) reg_di ``` ``` ## OLS estimation, Dep. Var.: wage ## Observations: 526 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.200496 0.843562 0.237678 8.1222e-01 ## female -1.198523 1.325040 -0.904518 3.6614e-01 ## educ 0.539476 0.064223 8.400054 4.2437e-16 *** ## I(female * educ) -0.085999 0.103639 -0.829795 4.0703e-01 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 3.17433 Adj. R2: 0.255542 ``` -- <br> .content-box-red[**Interpretation**] The marginal benefit of education is 0.086 ($/hour) less for females workers than for male workers on average. --- <img src="data:image/png;base64,#modeling_x_files/figure-html/unnamed-chunk-24-1.png" width="80%" style="display: block; margin: auto;" /> --- # Categorical variable: more than two states .content-box-green[**Issue**] + Consider a variable called `\(degree\)` which has three status values: college, master, and doctor. + Unlike a binary variable, there are three status values. + How do we include a categorical variable like this in a model? --- .content-box-green[**What do we do about this?**] You can create three dummy variables likes below: + `college`: 1 if the highest degree is college, 0 otherwise + `master`: 1 if the highest degree is Master's, 0 otherwise + `doctor`: 1 if the highest degree is Ph.D., 0 otherwise -- You then include two (the number of status values - 1) of the three dummy variables: --- .content-box-red[**Model**] `\(wage = \beta_0 + \sigma_m master +\sigma_d doctor + \beta_1 educ + u\)` -- + [college]: `\(E[wage|master=0, doctor = 0, educ] = \beta_0 + \beta_1 educ\)` -- + [master]: `\(E[wage|master=1, doctor = 0, educ] = \beta_0 + \sigma_m + \beta_1 educ\)` -- + [doctor]: `\(E[wage|master=0, doctor = 1, educ] = \beta_0 + \sigma_d + \beta_1 educ\)` -- <br> .content-box-red[**Interpretation**] `\(\sigma_m\)`: the impact of having a MS degree .red[relative to] having a .red[college degree] `\(\sigma_d\)`: the impact of having a Ph.D. degree .red[relative to] having a .red[college degree] -- <br> .content-box-red[**Important**] The omitted category (here, `college`) becomes the baseline. --- # Structural differences across groups .content-box-red[**Definition**] Structural difference refers to the fundamental differences in the model of a phenomenon in the population: --- .content-box-red[**Example**] .blue[Male]: `\(cumgpa = \alpha_0 + \alpha_1 sat + \alpha_2 hsperc + \alpha_3 tothrs + u\)` .blue[Female]: `\(cumgpa = \beta_0 + \beta_1 sat + \beta_2 hsperc + \beta_3 tothrs + u\)` + `\(cumgpa\)`: college grade points averages for male and female college athletes + `\(sat\)`: SAT score + `\(hsperc\)`: high school rank percentile + `\(tothrs\)`: total hours of college courses -- <br> .content-box-red[**In this example,**] `\(cumgpa\)` are determined in a fundamentally different manner between female and male students. You do not want to run a single regression that fits a single model for both female and male students. --- .content-box-red[**What to do?**] If you suspect that the underlying process of how the dependent variable is determined vary across groups, then you should test that hypothesis! <br> .content-box-red[**To do so,**] You estimate the model that allows to estimate separate models across groups within a single regression analysis. `$$cumgpa = \beta_0 + \sigma_0 female + \beta_1 sat + \sigma_1 (sat \times female)$$` `$$\;\; + \beta_2 hsperc + \sigma_2 (hsperc \times female)$$` `$$\qquad + \beta_3 tothrs + \sigma_3 (tothrs \times female) + u$$` --- .content-box-red[**The flexible model**] `$$cumgpa = \beta_0 + \sigma_0 female + \beta_1 sat + \sigma_1 (sat \times female)$$` `$$\;\; + \beta_2 hsperc + \sigma_2 (hsperc \times female)$$` `$$\qquad + \beta_3 tothrs + \sigma_3 (tothrs \times female) + u$$` <br> .content-box-green[**Male**] `\(E[cumgpa] = \beta_0 + \beta_1 sat + \beta_2 hsperc + \beta_3 tothrs\)` <br> .content-box-green[**Feale**] `\(E[cumgpa] = (\beta_0 +\sigma_0) + (\beta_1+\sigma_1) sat + (\beta_2+\sigma_2) hsperc + (\beta_3+\sigma_3) tothrs\)` <br> .content-box-red[**Interpretation**] + `\(\beta\)`s are commonly shared by female and male students + `\(\sigma\)`s capture the differences between female and male students --- .content-box-red[**Null Hypothesis (verbal)**] The model of GPA for male and female students are not structurally different. <br> .content-box-red[**Null Hypothesis**] `\(H_0: \;\; \sigma_0=0,\;\; \sigma_1=0, \;\; \sigma_2=0, \;\; \mbox{and} \;\; \sigma_3=0\)` <br> .content-box-green[**Question**] What test do we do? t-test or F-test? --- class: middle .content-box-red[**R code**] Run the unrestricted model with all the interaction terms: ```r data("gpa3", package = "wooldridge") gpa <- gpa3 %>% dplyr::filter(!is.na(ctothrs)) %>% #--- create interaction terms ---# dplyr::mutate( female_sat := female * sat, female_hsperc := female * hsperc, female_tothrs := female * tothrs ) #--- regression with female dummy ---# reg_full <- fixest::feols( cumgpa ~ female + sat + female_sat + hsperc + female_hsperc + tothrs + female_tothrs, data = gpa ) ``` --- class: middle .left5[ .content-box-red[**What do you see?**] + None of the variables that involve `\(female\)` are statistically significant at the 5% level individually. + Does this mean that `\(male\)` and `\(female\)` students have the same regression function? + No, we are testing the joint significance of the coefficients. We need to do an `\(F\)`-test! ] .right5[ <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;">  (1) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 1.481*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.207) </td> </tr> <tr> <td style="text-align:left;"> female </td> <td style="text-align:center;"> −0.353 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.411) </td> </tr> <tr> <td style="text-align:left;"> sat </td> <td style="text-align:center;"> 0.001*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.000) </td> </tr> <tr> <td style="text-align:left;"> female_sat </td> <td style="text-align:center;"> 0.001+ </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.000) </td> </tr> <tr> <td style="text-align:left;"> hsperc </td> <td style="text-align:center;"> −0.008*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.001) </td> </tr> <tr> <td style="text-align:left;"> female_hsperc </td> <td style="text-align:center;"> −0.001 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.003) </td> </tr> <tr> <td style="text-align:left;"> tothrs </td> <td style="text-align:center;"> 0.002** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.001) </td> </tr> <tr> <td style="text-align:left;"> female_tothrs </td> <td style="text-align:center;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.002) </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> ] --- ```r car::linearHypothesis( reg_full, c( "female = 0", "female_hsperc = 0", "female_sat = 0", "female_tothrs = 0" ) ) ``` ``` ## Linear hypothesis test ## ## Hypothesis: ## female = 0 ## female_hsperc = 0 ## female_sat = 0 ## female_tothrs = 0 ## ## Model 1: restricted model ## Model 2: cumgpa ~ female + sat + female_sat + hsperc + female_hsperc + ## tothrs + female_tothrs ## ## Df Chisq Pr(>Chisq) ## 1 ## 2 4 32.716 1.365e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- class: inverse, center, middle name: use-i # R coding tips: categorical variables and interaction terms <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- class: middle # R coding tips: categorical variables and interaction terms ```r #* load the package to access the data we want library(wooldridge) #* get big9salary data("big9salary") #* creat a variable that indicates university #* this is how the data would like most of the time (instead of having bunch of dummy variables) big9salary_c <- as_tibble(big9salary) %>% dplyr::mutate( university = case_when( osu == 1 ~ "Ohio State U", iowa == 1 ~ "U of Iowa", indiana == 1 ~ "Indiana U", purdue == 1 ~ "Purdue U", msu == 1 ~ "Michigan State U", mich == 1 ~ "Michigan U", wisc == 1 ~ "U of Wisconsin", illinois == 1 ~ "U of Illinois" ) ) %>% dplyr::relocate(id, year, salary, pubindx, university) ``` --- class: middle Take a look at the data, ```r head(big9salary_c) ``` ``` ## # A tibble: 6 × 31 ## id year salary pubindx university totpge assist assoc prof chair top20phd yearphd female osu iowa indiana purdue msu minn mich wisc illinois y92 y95 y99 lsalary exper expersq pubindxsq pubindx0 lpubindx ## <int> <int> <int> <dbl> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <dbl> <dbl> ## 1 101 92 NA 30.5 Indiana U 92.7 0 0 1 0 0 73 0 0 0 1 0 0 0 0 0 0 1 0 0 NA 19 361 933. 0 3.42 ## 2 101 95 NA 31.0 Indiana U 107. 0 0 1 0 0 73 0 0 0 1 0 0 0 0 0 0 0 1 0 NA 22 484 959. 0 3.43 ## 3 101 99 107100 40.5 Indiana U 186. 0 0 1 0 0 73 0 0 0 1 0 0 0 0 0 0 0 0 1 11.6 26 676 1636. 0 3.70 ## 4 102 92 79420 33.5 Indiana U 128. 0 0 1 0 0 76 0 0 0 1 0 0 0 0 0 0 1 0 0 11.3 16 256 1125. 0 3.51 ## 5 102 95 88239 33.9 Indiana U 133 0 0 1 0 0 76 0 0 0 1 0 0 0 0 0 0 0 1 0 11.4 19 361 1149. 0 3.52 ## 6 102 99 100450 36.2 Indiana U 192. 0 0 1 0 0 76 0 0 0 1 0 0 0 0 0 0 0 0 1 11.5 23 529 1313. 0 3.59 ``` ```r tail(big9salary_c) ``` ``` ## # A tibble: 6 × 31 ## id year salary pubindx university totpge assist assoc prof chair top20phd yearphd female osu iowa indiana purdue msu minn mich wisc illinois y92 y95 y99 lsalary exper expersq pubindxsq pubindx0 lpubindx ## <int> <int> <int> <dbl> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <dbl> <dbl> ## 1 932 92 90856 72.7 U of Wisconsin 269. 0 0 1 0 1 73 1 0 0 0 0 0 0 0 1 0 1 0 0 11.4 19 361 5287. 0 4.29 ## 2 932 95 110090 73.5 U of Wisconsin 294 0 0 1 0 1 73 1 0 0 0 0 0 0 0 1 0 0 1 0 11.6 22 484 5396. 0 4.30 ## 3 932 99 122397 75.2 U of Wisconsin 315 0 0 1 0 1 73 1 0 0 0 0 0 0 0 1 0 0 0 1 11.7 26 676 5649. 0 4.32 ## 4 933 92 45755 2.19 U of Wisconsin 9.5 1 0 0 0 1 91 0 0 0 0 0 0 0 0 1 0 1 0 0 10.7 1 1 4.80 0 0.784 ## 5 933 95 51846 8.11 U of Wisconsin 88 1 0 0 0 1 92 0 0 0 0 0 0 0 0 1 0 0 1 0 10.9 3 9 65.8 0 2.09 ## 6 933 99 69630 59.5 U of Wisconsin 208. 0 1 0 0 1 93 0 0 0 0 0 0 0 0 1 0 0 0 1 11.2 6 36 3534. 0 4.09 ``` --- class: middle You can use the `i()` function inside `fixest::feols()` like below: ```r fixest::feols(salary ~ pubindx + female + `i(university, ref = "Indiana U")`, data = big9salary_c) %>% broom::tidy() ``` ``` ## # A tibble: 10 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 74544. 3001. 24.8 9.34e-94 ## 2 pubindx 346. 26.6 13.0 3.18e-34 ## 3 female -5877. 3067. -1.92 5.59e- 2 *## 4 university::Michigan State U -9188. 3631. -2.53 1.17e- 2 ## 5 university::Michigan U -11561. 3833. -3.02 2.67e- 3 ## 6 university::Ohio State U -4707. 3790. -1.24 2.15e- 1 ## 7 university::Purdue U -10517. 4310. -2.44 1.50e- 2 ## 8 university::U of Illinois -1809. 3686. -0.491 6.24e- 1 ## 9 university::U of Iowa -519. 3951. -0.131 8.95e- 1 ## 10 university::U of Wisconsin -6840. 4186. -1.63 1.03e- 1 ``` `ref = "Indiana U"` sets the base category to `"Indiana U"`. So, for example, the highlighted line means that faculty memebers at Michigan State U make `\(9,118\)` USD less annually than those at Indiana U. <br> .content-box-green[**Key**] You do not have to make bunch of dummy variables like the original dataset. Just use `i(catergory_variable)`. --- class: middle # Interactions terms You can use `i()` for creating interactions of a categorical variable and a continuous variable. Suppose you are interested in understanding the impact of `pubindx` (continuous) by `university` (categorical), then ```r fixest::feols(salary ~ female + pubindx + i(university, ref = "Indiana U") + `i(university, totpge, ref = "Indiana U")`, data = big9salary_c) %>% broom::tidy() ``` ``` ## # A tibble: 17 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 79593. 4267. 18.7 3.02e-61 ## 2 female -3782. 3113. -1.21 2.25e- 1 ## 3 pubindx 42.5 172. 0.247 8.05e- 1 ## 4 university::Michigan State U -17995. 5190. -3.47 5.65e- 4 ## 5 university::Michigan U -13162. 5577. -2.36 1.86e- 2 ## 6 university::Ohio State U -10073. 5633. -1.79 7.42e- 2 ## 7 university::Purdue U -19022. 6291. -3.02 2.61e- 3 ## 8 university::U of Illinois -12818. 5568. -2.30 2.17e- 2 ## 9 university::U of Iowa -11785. 5510. -2.14 3.29e- 2 ## 10 university::U of Wisconsin -8197. 6132. -1.34 1.82e- 1 *## 11 university::Michigan State U:pubindx 436. 191. 2.29 2.25e- 2 ## 12 university::Michigan U:pubindx 253. 177. 1.43 1.54e- 1 ## 13 university::Ohio State U:pubindx 305. 185. 1.65 9.96e- 2 ## 14 university::Purdue U:pubindx 422. 212. 2.00 4.65e- 2 ## 15 university::U of Illinois:pubindx 594. 225. 2.64 8.44e- 3 ## 16 university::U of Iowa:pubindx 588. 206. 2.85 4.50e- 3 ## 17 university::U of Wisconsin:pubindx 247. 180. 1.37 1.70e- 1 ``` So, the marginal impact of `pubindex` is `\(436\)` greater for those at Michigan State U than those at Indiana U. --- class: inverse, center, middle name: misc # Other miscellaneous topic <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- # Goodness of fit: `\(R^2\)` .content-box-red[**Important**] Small value of `\(R^2\)` does not mean the end of the world (In fact, we could not care less about it.) --- .content-box-green[**Example**] `$$ecolabs = \beta_0 + \beta_1 regprc + \beta_2 ecoprc$$` + `\(ecolabs\)`: the (hypothetical) pounds of ecologically friendly (ecolabled) apples a family would demand + `\(regprc\)`: prices of regular apples + `\(ecoprc\)`: prices of the hypothetical ecolabled apples <br> .content-box-red[**Key**] + The data was obtained via survey and `\(ecoprc\)` was set randomly (So, we know `\(E[u|x] = 0\)`) by the researcher. + The (only) objective of the study is to understand the impact of the price of ecolabled apple on the demand for ecolabled apples. --- % Error: Unrecognized object type. --- Suppose you are challenged by somebody who claim that your regression is not .blue[good] because the `\(R^2\)` is tiny. How would your respond to his/her attack? --- class: inverse, center, middle name: review # Scaling <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1000px></html> --- .content-box-green[**Questions**] What happens if you scale up/down variables used in regression? + coefficients + standard errors + t-statistics + `\(R^2\)` --- ```r #--- regression with original scale ---# reg_no_scale <- fixest::feols(wage ~ female + educ, data = wage1) #--- regression with scaled educ ---# reg_scale <- fixest::feols(wage ~ female + I(educ * 12), data = wage1) ``` --- .left5[ ```r modelsummary::msummary( list(reg_no_scale, reg_scale), stars = TRUE, gof_omit = "IC|Log|Adj|F|Pseudo|Within" ) ``` <br> .content-box-green[**So,**] + coefficient: 1/12 + standard error: 1/12 + t-stat: the same + `\(R^2\)`: the same ] .right5[ <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;">  (1) </th> <th style="text-align:center;">   (2) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 0.623 </td> <td style="text-align:center;"> 0.623 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.673) </td> <td style="text-align:center;"> (0.673) </td> </tr> <tr> <td style="text-align:left;"> female </td> <td style="text-align:center;"> −2.273*** </td> <td style="text-align:center;"> −2.273*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.279) </td> <td style="text-align:center;"> (0.279) </td> </tr> <tr> <td style="text-align:left;"> educ </td> <td style="text-align:center;"> 0.506*** </td> <td style="text-align:center;"> </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.050) </td> <td style="text-align:center;"> </td> </tr> <tr> <td style="text-align:left;"> I(educ * 12) </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> 0.042*** </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1.5px"> </td> <td style="text-align:center;box-shadow: 0px 1.5px"> </td> <td style="text-align:center;box-shadow: 0px 1.5px"> (0.004) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 526 </td> <td style="text-align:center;"> 526 </td> </tr> <tr> <td style="text-align:left;"> R2 </td> <td style="text-align:center;"> 0.259 </td> <td style="text-align:center;"> 0.259 </td> </tr> <tr> <td style="text-align:left;"> RMSE </td> <td style="text-align:center;"> 3.18 </td> <td style="text-align:center;"> 3.18 </td> </tr> <tr> <td style="text-align:left;"> Std.Errors </td> <td style="text-align:center;"> IID </td> <td style="text-align:center;"> IID </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> ] --- .content-box-red[**Interpretation**] + Regression .blue[without] scaling hourly wage increases by `\(0.506\)` if education increases by a .blue[year] + Regression .blue[with] scaling (e.g., 48 means 4 years) hourly wage increases by `\(0.0422\)` if education increases by a .blue[month] -- <br> .content-box-green[**Note**] According to the scaled model, hourly wage increases by `\(0.0422 * 12\)` if education increases by a year (12 months). That is, the estimated marginal impact of education on wage from the scaled model is the same as that from the non-scaled model. --- .content-box-red[**Summary**] When an independent variable is scaled, + its coefficient estimate and standard error are going to be scaled up/back to the exact degree the variable is scaled up/back + t-statistics stays the same (as it should be) + `\(R^2\)` stays the same (the model does not improve by simply scaling independent variables)