01-2: Univariate Regression: OLS Mechanics and Implementation

Estimation of Parameters via OLS

The data set and model

Set up
R code to get data
Data visualization

Data

Observations of house price and lot size for 546 houses.

Model

\[price_i = \beta_0 + \beta_1 lotsize_i+u_i\]

$price_i$: house price ($) of house $i$
$lotsize_i$: lot size of house $i$
$u_i$: error term (everything else) of house $i$

Objective

Estimate the impact of lot size on house price

We want to draw a line like this, the slope of which is an estimate of $\beta_1$
A way: Ordinary Least Squares (OLS)

Ex. 1: $\widehat{\beta}_0=20000$, $\widehat{\beta}_1=7$
Ex. 2: $\widehat{\beta}_0=70000$, $\widehat{\beta}_1=3.8$
So,

Question

Among all the possible values of $\beta_0$ and $\beta_1$, which one is the best?
What criteria do we use (what does the best even mean?)

For particular values of $\widehat{\beta}_0$ and $\widehat{\beta}_1$ you pick, the modeled value of $y$ for individual $i$ is $\widehat{\beta}_0 + \widehat{\beta}_1 x_i$.

Then, the residual for individual $i$ is:

\[ \widehat{u}_i = y_i - (\widehat{\beta}_0 + \widehat{\beta}_1 x_i) \]

That is, residual is the observed value of the dependent variable less the value of modeled value. For different values of $\widehat{\beta}_0$ and $\widehat{\beta}_1$, you have a different value of residual.

Idea of OLS (Ordinary Least Squares)

Let’s find the value of $\beta_0$ and $\beta_1$ that minimizes the sum of the squared residuals!

Mathematically

Solve the following minimization problem:

\[Min_{\widehat{\beta}_0,\widehat{\beta}_1} \sum_{i=1}^n \widehat{u}_i^2, \mbox{where} \;\; \widehat{u}_i=y_i-(\widehat{\beta}_0+\widehat{\beta}_1 x_i)\]

Questions

Why do we square the residuals, and then sum them up together? What’s gonna happen if you just sum up residuals?
How about taking the absolute value of residuals, and then sum them up?

Minimization problem to solve

\[Min_{\widehat{\beta}_0,\widehat{\beta}_1} \sum_{i=1}^n [y_i-(\widehat{\beta}_0+\widehat{\beta}_1 x_i)]^2\]

Steps

partial differentiation of the objective function with respect to $\widehat{\beta}_0$ and $\widehat{\beta}_1$
solve for $\widehat{\beta}_0$ and $\widehat{\beta}_1$

\[Min_{\widehat{\beta}_0,\widehat{\beta}_1} \sum_{i=1}^n [y_i-(\widehat{\beta}_0+\widehat{\beta}_1 x_i)]^2\]

FOC

\[ \def\sumn{\sum_{i=1}^{n}} \begin{align} \frac{\partial }{\partial \widehat{\beta}_0}=& 2 \sumn [y_i-(\widehat{\beta}_0+\widehat{\beta}_1 x_i)]=0 \\\\ \frac{\partial }{\partial \widehat{\beta}_1}=& 2 \sumn x_i\cdot [y_i-(\widehat{\beta}_0+\widehat{\beta}_1 x_i)]= \sumn x_i\cdot \widehat{u}_i = 0 \end{align} \]

OLS estimators: analytical formula

\[ \def\sumn{\sum_{i=1}^{n}} \begin{aligned} \widehat{\beta}_1 & = \frac{\sumn (x_i-\bar{x})(y_i-\bar{y})}{\sumn (x_i-\bar{x})^2},\\\\ \widehat{\beta}_0 & = \bar{y}-\widehat{\beta}_1 \bar{x}, \\\\ \mbox{where} & \;\; \bar{y} = \sumn y_i/n \;\; \mbox{and} \;\;\bar{x} = \sumn x_i/n \end{aligned} \]

Estimators

Specific rules (formula) to use once you get the data

Estimates

Numbers you get once you plug values (your data) into the formula

OLS demonstration in R

R code: hard way
R code: a better way
post-estimation

OLS Estimator Formula

\[ \def\sumn{\sum_{i=1}^{n}} \begin{aligned} \widehat{\beta}_1 & = \frac{\sumn (x_i-\bar{x})(y_i-\bar{y})}{\sumn (x_i-\bar{x})^2}\\\\ \widehat{\beta}_0 & = \bar{y}-\widehat{\beta}_1 \bar{x} \end{aligned} \]

R code

We can use the feols() function from the fixest package.

Lots of information is stored in the regression results (here, uni_reg), which is of class list.

Apply ls() to see its elements:

Estimated coefficients:

Predicted values at the observation points:

Residuals:

You can have a nice quick summary of the regression results with summary() function:

Once the model is estimated

Estimated model
Predicted values (R)
New predictions (R)
Exercise: The impact of lotsize

Model to be estimated

\[ price = \beta_0 + \beta_1 lotsize + u \]

Estimated Model

This is the estimated version of the expected value of $y$ conditional on $x$.

\[ price = 3.4136\times 10^{4} + 6.599 \times lotsize \]

This is called sample regression function (SRF) , and it is an estimation of $E[price|lotsize]$, the population regression function (PRF).

Important

OLS regression predicts the expected value of the dependent variable conditional on the explanatory variables.
$\widehat{\beta}_1$ is an estimate of how a change in $x$ affects the expected value of $y$.

You can access the predicted values at the observed points by looking at the fitted.value element of the regression results.

To calculate the predicted value at arbitrary values of $x$,

create a new data.frame with values of $x$ of your choice.

apply predict() to the data.frame using the regression results.

Problem
Solution

Your current lot size is 3000. You are thinking of expanding your lot by 1000 (with everything else fixed), which would cost you 5,000 USD. Should you do it? Use R to figure it out.

$R^2$: Goodness of fit

What is it?
Decompose $y$
Visualization
$R^2$ components
Definition of $R^2$
Caveat

$R^2$ is a measure of how good your model is in predicting the dependent variable (explaining variations in the dependent variable) compared to just using the average of the dependent variable as the predictor.

You can decompose observed value of $y$ into two parts: fitted value and residual

\[ y_i=\widehat{y}_i +\widehat{u}_i, \;\;\mbox{where}\;\; \widehat{y}_i = \widehat{\beta}_0+\widehat{\beta}_1 x_i \]

now, subtracting $\bar{y}$ (sample average of $y$),

\[ y_i-\bar{y}=\widehat{y}_i-\bar{y}+\widehat{u}_i \]

$y_i-\bar{y}$: how far away the actual value of $y$ for $i$th observation from the sample average $\bar{y}$ is (actual deviation from the mean)
$\widehat{y_i}-\bar{y}$: how far away the predicted value of $y$ for $i$th observation from the sample average $\bar{y}$ is (explained deviation from the mean)
$\widehat{u_i}$: the residual for $i$th observation

$y_i-\bar{y}$: how far away the actual value of $y$ for $i$th observation from the sample average $\bar{y}$ is (actual deviation from the mean)
$\widehat{y_i}-\bar{y}$: how far away the predicted value of $y$ for $i$th observation from the sample average $\bar{y}$ is (explained deviation from the mean)
$\widehat{u_i}$: the residual for $i$th observation

total sum of squares (SST)

\[ SST\equiv \sum_{i=1}^{n}(y_i-\bar{y})^2 \]

explained sum of squares (SSE) \[ SSE\equiv \sum_{i=1}^{n}(\widehat{y}_i-\bar{y})^2 \]

residual sum of squares (SSR) \[ SSR\equiv \sum_{i=1}^{n}\widehat{u}_i^2 \]

Definition

$R^2 = 1 - SSR/SST$

Where did it come from?

\[\begin{align} & SST = SSE + SSR \\ \Rightarrow & SSE = SST - SSR \\ \Rightarrow & SSE/SST = 1 - SSR/SST = R^2\\ \end{align}\]

The value of $R^2$ always lies between $0$ and $1$ as long as an intercept is included in the econometric model.

What does it measure?

$R^2$ is a measure of how much improvement in predictin the depdent variable you’ve made by including independent variable(s) $(y=\beta_0+\beta_1 x+u)$ compared to when simply using the mean of dependent variable as the predictor $(y=\beta_0+u)$.

Important

$R^2$ tells nothing about how well you have estimated the causal ceteris paribus impact of $x$ on $y$ $(\beta_1)$.
As an economist, we typically do not care about how well we can prefict yield, rather we care about how well we have predicted $\beta$.

Problem

While we observe the dependent variable (otherwise you cannot run regression), we cannot observe $\beta_1$.
So, we get to check how good estimated models are in predicting the dependent variable (which we do not care), but we can never test whether they have estimated $\beta_1$ well.
This means that we need to carefully examines whether the assumptions necessary for good estimation of $\beta_1$ is satisfied (next topic).

01-2: Univariate Regression: OLS Mechanics and Implementation

Estimation of Parameters via OLS

The data set and model

Estimation with OLS

OLS demonstration in R

OLS demonstration in R

Once the model is estimated

\(R^2\): Goodness of fit