Monte Carlo Simulation

# Monte Carlo Simulation
### AECN 896-002

---

.remark-slide-content.hljs-github h1 {
  margin-top: 5px;  
  margin-bottom: 25px;  
}

.remark-slide-content.hljs-github {
  padding-top: 10px;  
  padding-left: 30px;  
  padding-right: 30px;  
}

.panel-tabs {
  
  color: #841F27;
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 0px;  
  padding-bottom: 0px;  
}

.panel-tab {
  margin-top: 0px;  
  margin-bottom: 0px;  
  margin-left: 3px;  
  margin-right: 3px;  
  padding-top: 0px;  
  padding-bottom: 0px;  
}

.panelset .panel-tabs .panel-tab {
  min-height: 40px;
}

.remark-slide th {
  border-bottom: 1px solid #ddd;
}

.remark-slide thead {
  border-bottom: 0px;
}

.gt_footnote {
  padding: 2px;  
}

.remark-slide table {
  border-collapse: collapse;
}

.remark-slide tbody {
  border-bottom: 2px solid #666;
}

.important {
  background-color: lightpink;
  border: 2px solid blue;
  font-weight: bold;
}

.remark-code {
  display: block;
  overflow-x: auto;
  padding: .5em;
  background: #ffe7e7;
}

.hljs-github .hljs {
  background: #f2f2fd;
}

.remark-inline-code {
  padding-top: 0px;
  padding-bottom: 0px;
  background-color: #e6e6e6;
}

.r.hljs.remark-code.remark-inline-code{
  font-size: 0.9em
}

.left-full {
  width: 80%;
  height: 92%;
  float: left;
}

.left-code {
  width: 38%;
  height: 92%;
  float: left;
}

.right-plot {
  width: 60%;
  float: right;
  padding-left: 1%;
}

.left5 {
  width: 49%;
  height: 92%;
  float: left;
}

.right5 {
  width: 49%;
  float: right;
  padding-left: 1%;
}

.left3 {
  width: 29%;
  height: 92%;
  float: left;
}

.right7 {
  width: 69%;
  float: right;
  padding-left: 1%;
}

.left4 {
  width: 38%;
  height: 92%;
  float: left;
}

.right6 {
  width: 60%;
  float: right;
  padding-left: 1%;
}

ul li{
  margin: 7px;
}

ul, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

ol li{
  margin: 7px;
}

ol, li{
  margin-left: 15px; 
  padding-left: 0px; 
}

</style>

.full-width {
    display: flex;
    width: 100%;
    flex: 1 1 auto;
}
</style>

.blockquote p {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h1:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h2:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h3:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}
.blockquote > h4:first-of-type {
  margin-top: 0px;
  margin-bottom: 5px;
}

.text-shadow {
  text-shadow: 0 0 4px #424242;
}
</style>

.scroll-box-8 {
  height:8em;
  overflow-y: scroll;
}
.scroll-box-10 {
  height:10em;
  overflow-y: scroll;
}
.scroll-box-12 {
  height:12em;
  overflow-y: scroll;
}
.scroll-box-14 {
  height:14em;
  overflow-y: scroll;
}
.scroll-box-16 {
  height:16em;
  overflow-y: scroll;
}
.scroll-box-18 {
  height:18em;
  overflow-y: scroll;
}
.scroll-box-20 {
  height:20em;
  overflow-y: scroll;
}
.scroll-box-24 {
  height:24em;
  overflow-y: scroll;
}
.scroll-box-30 {
  height:30em;
  overflow-y: scroll;
}
.scroll-output {
  height: 90%;
  overflow-y: scroll;
}

</style>

$$
\def\sumten{\sum_{i=1}^{10}}
$$

$$
\def\sumn{\sum_{i=1}^{n}}
$$

# Outline

1. [Introduction](#mvr)
2. [MC Simulations](#mcs)

---

# Monte Carlo Simulation: Introduction

---
class: middle

A way to test econometric theories via simulation

---
class: middle

.content-box-green[**How is it used in econometrics?**]
+ confirm ecoometric theory numerically
    - OLS estimators are unbiased if `$E[u|x]=0$` along with other conditions (theory)
    - I know the above theory is right, but let's check if it is true numerically
+ You kind of sense that something in your data may cause problems, but there is no proven econometric theory about what's gonna happen (I used MC simulation for this purpose a lot)
+ assist students in understanding econometric theories by providing actual numbers instead of a series of Greek letters

---
class: middle

Suppose you are interested in checking what happens to OLS estimators if `$E[u|x]=0$` (the error term and `$x$` are not correlated) is violated.

Can you use the real data to do this?

---
class: middle

<span style = "color: blue;"> You </span> generate data (you have control over how data are generated)

+ You know the true parameter unlike the real data generating process
+ You can change only the part that you want to change about data generating process and econometric methods with everything else fixed

---
class: middle

# Generating data

Algorithms for generating a sequence of numbers whose properties <span style = "color: blue;"> approximate </span> the properties of sequences of random numbers

---
class: middle

```r
runif(5) # default is min=0 and max=1
```

```
## [1] 0.16009897 0.67268728 0.81840298 0.60820923 0.01744986
```

---
class: middle

```r
x <- runif(10000)
hist(x)
```

---
class: middle

+ Pseudo random number generators are not really random number generators
+ What numbers you will get are pre-determined
+ What numbers you will get can be determined by setting a <span style = "color: red;"> seed </span>

```r
set.seed(2387438)
runif(5)
```

```
## [1] 0.0474233 0.7116970 0.4066674 0.2422949 0.3567480
```

What benefits does setting a seed have?

---
class: middle

`$x \sim N(0, 1)$`

```r
# default is mean = 0,sd = 1
x <- rnorm(10000)
hist(x)
```

<img src="data:image/png;base64,#MC_x_files/figure-html/norm_1-1.png" width="100%" style="display: block; margin: auto;" />
]

`$x \sim N(2, 2)$`

```r
# mean = 2, sd = 2
x <- rnorm(10000, mean = 2, sd = 2)
hist(x)
```

<img src="data:image/png;base64,#MC_x_files/figure-html/norm_2-1.png" width="100%" style="display: block; margin: auto;" />
]

---
class: middle

+ Beta
+ Chi-square
+ F
+ Logistic
+ Log-normal
+ many others

---
class: middle

For each distribution, you have four different kinds of functions:

+ <span style = "color: red;"> `d`</span>`norm`: density function
+ <span style = "color: red;"> `p`</span>`norm`: distribution function
+ <span style = "color: red;"> `q`</span>`norm`: quantile function
+ <span style = "color: red;"> `r`</span>`norm`: random draw

---
class: middle

`dnorm(x)` gives you the height of the density function at `$x$`.

---
class: middle

`dnorm(-1)` and `dnorm(2)`

---
class: middle

`pnorm(x)` gives you the probability that a single random draw is <span style = "color: red;"> less </span> than `$x$`.

---
class: middle

`pnorm(-1)`

<img src="data:image/png;base64,#MC_x_files/figure-html/pnorm1_1-1.png" width="60%" style="display: block; margin: auto;" />
---
class: middle

`pnorm(2)`

---
class: middle

What is the probability that a single random draw from a Normal distribution with `mean = 1` and `sd = 2` is less than 1?

---
class: middle

`qnorm(x)`, where `$0 < x < 1$`, gives you a number `$\pi$`, where the probability of observing a number from a single random draw is less than `$\pi$` with probability of `$x$`.

We call the output of `qnorm(x)`, `$x%$` quantile of the standard Normal distribution (because the default is `mean = 0` and `sd = 1` for `rnorm()`).

---
class: middle

`qnorm(0.95)`

---
class: middle

What is the 88% quantile of Normal distribution with `mean = 0` and `sd = 9`?

---

# Monte Carlo Simulation: Introduction

---
class: middle

+ specify the data generating process
+ generate data based on the data generating process
+ get an estimate based on the generated data (e.g. OLS, mean)
+ repeat the above steps many many times
+ compare your estimates with the true parameter

Why do the steps `$1-3$` many many times?

---
class: middle

# Monte Carlo Simulation: Example 1

Is sample mean really an unbiased estimator of the expected value?

That is, is `$E[\frac{1}{n}\sum_{i=1}^n x_i] = E[x]$`, where `$x_i$` is an independent random draw from the same distribution,

---
class: middle

```r
#--- steps 1 and 2:  ---#
# specify the data generating process and generate data
x <- runif(100) # Here, E[x]=0.5

#--- step 3 ---#
# calculate sample mean
mean_x <- mean(x)
mean_x
```

```
## [1] 0.507078
```

---
class: middle

+ repeat the above steps many times
+ We use a <span style = "color: blue;"> loop </span> to do the same (similar) thing over and over again

---
class: middle

```r
#--- the number of iterations ---#
B <- 1000

#--- repeat steps 1-3 B times ---#
for (i in 1:B) {
  print(i) # print i
}
```

For each of `$i$` in `$1:B$` `$(1, 2, \dots, 1000)$`, do `print(i)`.

+ `i` takes the value of 1, and then `print(1)`
+ `i` takes the value of 2, and then `print(2)`
+ ...
+ `i` takes the value of 999, and then `print(999)`
+ `i` takes the value of 1000, and then `print(1000)`

---
class: middle

```r
#--- the number of iterations ---#
B <- 1000

#--- create a storage that stores estimates ---#
estimate_storage_mean <- rep(0, B)

#--- repeat steps 1-3 B times ---#
for (i in 1:B) {
  #--- steps 1 and 2:  ---#
  # specify the data generating process and generate data
  x <- runif(100) # Here, E[x]=0.5

#--- step 3 ---#
  # calculate sample mean
  mean_x <- mean(x)
  estimate_storage_mean[i] <- mean_x
}
```

---
class: middle

Compare your estimates with the true parameter

```r
mean(estimate_storage_mean)
```

```
## [1] 0.500199
```

```r
hist(estimate_storage_mean)
```

---
class: middle

# Monte Carlo Simulation: Example 2

What happens to `$\beta_1$` if `$E[u|x]\ne 0$` when estimating `$y=\beta_0+\beta_1 x + u$`?

---
class: middle

```r
#--- load the fixest pacakge for feols() ---#
library(fixest)

#--- Preparation ---#
B <- 1000 # the number of iterations
N <- 100 # sample size
estimate_storage <- rep(0, B) # estimates storage

#--- repeat steps 1-3 B times ---#
for (i in 1:B) {
  #--- steps 1 and 2:  ---#
  mu <- rnorm(N) # the common term shared by both x and u
  x <- rnorm(N) + mu # independent variable
  u <- rnorm(N) + mu # error
  y <- 1 + x + u # dependent variable
  data <- data.frame(y = y, x = x)

#--- OLS ---#
  reg <- feols(y ~ x, data = data) # OLS
  estimate_storage[i] <- reg$coefficient["x"]
}
```

---
class: middle

```r
hist(estimate_storage)
```

---
class: middle

# Examle 3: Variance of OLS Estimators

`\begin{aligned}
    y = \beta_0 + \beta_1 x + u \\
\end{aligned}`

+ `$x\sim N(0,1)$`
+ `$u\sim N(0,1)$`
+ `$E[u|x]=0$`

True Variance of `$\hat{\beta_1}$`: `$V(\hat{\beta_1}) = \frac{\sigma^2}{\sum_{i=1}^n (x_i-\bar{x})^2} = \frac{\sigma^2}{SST_X}$`

Its estimator: `$\widehat{V(\hat{\beta_1})} =\frac{\hat{\sigma}^2}{SST_X} = \frac{\sum_{i=1}^n \hat{u}_i^2}{n-2} \times \frac{1}{SST_X}$`

Does the estimator really work? (Is it unbiased?)

---
class: middle

```r
set.seed(903478)

#--- Preparation ---#
B <- 10000 # the number of iterations
N <- 100 # sample size
beta_storage <- rep(0, B) # estimates storage for beta
V_beta_storage <- rep(0, B) # estimates storage for V(beta)
x <- rnorm(N) # x values are the same for every iteration
SST_X <- sum((x - mean(x))^2)

#--- repeat steps 1-3 B times ---#
for (i in 1:B) {
  #--- steps 1 and 2:  ---#
  u <- 2 * rnorm(N) # error
  y <- 1 + x + u # dependent variable
  data <- data.frame(y = y, x = x)

#--- OLS ---#
  reg <- feols(y ~ x, data = data) # OLS
  beta_storage[i] <- reg$coefficient["x"]
  #* store estimated variance of beta_1_hat
  V_beta_storage[i] <- vcov(reg)["x", "x"]
}
```

---
class: middle

+ `$SST_X = 112.07$`
+ `$\sigma^2 = 4$`

`$$V(\hat{\beta}) = 4/112.07 = 0.0357$$`

Your Estimates of Variance of `$\hat{\beta_1}$`?

```r
# === mean ===#
mean(V_beta_storage)
```

```
## [1] 0.03579118
```
---
class: middle

```r
ggplot(data = data.frame(x = V_beta_storage)) +
  geom_density(aes(x = x)) +
  geom_vline(xintercept = round(4 / SST_X, digits = 4))
```

---
class: middle

# Exercise

Using MC simulations, find out how the variation in `$x$` affects the OLS estimators

---
class: middle

`\begin{aligned}
  y = \beta_0 + \beta_1 x_1 + u \\
  y = \beta_0 + \beta_1 x_2 + u
\end{aligned}`

+ `$x_1\sim N(0,1)$` and `$x_2\sim N(0,9)$`
+ `$u\sim N(0,1)$`
+ `$E[u_1|x]=0$` and `$E[u_2|x]=0$`

---
class: middle

# Solution

```r
#--- Preparation ---#
B <- 1000 # the number of iterations
N <- 100 # sample size
estimate_storage <- matrix(0, B, 2) # estimates storage

for (i in 1:B) {
  #--- generate data ---#
  x_1 <- rnorm(N, sd = 1) # indep var 1
  x_2 <- rnorm(N, sd = 3) # indep var 2
  u <- rnorm(N) # error
  y_1 <- 1 + x_1 + u # dependent variable 1
  y_2 <- 1 + x_2 + u # dependent variable 2
  data <- data.table(y_1 = y_1, y_2 = y_2, x_1 = x_1, x_2 = x_2)

#--- OLS ---#
  reg_1 <- feols(y_1 ~ x_1, data = data) # OLS
  reg_2 <- feols(y_2 ~ x_2, data = data) # OLS

#--- store coef estimates ---#
  estimate_storage[i, 1] <- reg_1$coefficient["x_1"] # equation 1
  estimate_storage[i, 2] <- reg_2$coefficient["x_2"] # equation 2
}
```

```r
#--- assign new names ---#
beta_1s <- estimate_storage[, 1]
beta_2s <- estimate_storage[, 2]

#--- mean ---#
mean(beta_1s)
```

```
## [1] 0.9961388
```

```r
mean(beta_2s)
```

```
## [1] 0.9994399
```

```r
#--- sd ---#
sd(beta_1s)
```

```
## [1] 0.1041601
```

```r
sd(beta_2s)
```

```
## [1] 0.03322701
```

---
class: middle

# Visualization

```r
plot_data_1 <- data.table(x = beta_1s, type = "Equation 1")
plot_data_2 <- data.table(x = beta_2s, type = "Equation 2")
plot_data <- rbind(plot_data_1, plot_data_2)
ggplot(data = plot_data) +
  geom_density(aes(x = x, fill = type), alpha = 0.5) +
  scale_fill_discrete(name = "") +
  xlab("Coefficient Estimate") +
  theme(
    legend.position = "bottom"
  )
```