Data
Observations of house price and lot size for 546 houses.
Model
\[price_i = \beta_0 + \beta_1 lotsize_i+u_i\]
Objective
Estimate the impact of lot size on house price
Question
For particular values of \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) you pick, the modeled value of \(y\) for individual \(i\) is \(\widehat{\beta}_0 + \widehat{\beta}_1 x_i\).
Then, the residual for individual \(i\) is:
\[ \widehat{u}_i = y_i - (\widehat{\beta}_0 + \widehat{\beta}_1 x_i) \]
That is, residual is the observed value of the dependent variable less the value of modeled value. For different values of \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\), you have a different value of residual.
Idea of OLS (Ordinary Least Squares)
Let’s find the value of \(\beta_0\) and \(\beta_1\) that minimizes the sum of the squared residuals!
Mathematically
Solve the following minimization problem:
\[Min_{\widehat{\beta}_0,\widehat{\beta}_1} \sum_{i=1}^n \widehat{u}_i^2, \mbox{where} \;\; \widehat{u}_i=y_i-(\widehat{\beta}_0+\widehat{\beta}_1 x_i)\]
Questions
Why do we square the residuals, and then sum them up together? What’s gonna happen if you just sum up residuals?
How about taking the absolute value of residuals, and then sum them up?
Minimization problem to solve
\[Min_{\widehat{\beta}_0,\widehat{\beta}_1} \sum_{i=1}^n [y_i-(\widehat{\beta}_0+\widehat{\beta}_1 x_i)]^2\]
Steps
\[Min_{\widehat{\beta}_0,\widehat{\beta}_1} \sum_{i=1}^n [y_i-(\widehat{\beta}_0+\widehat{\beta}_1 x_i)]^2\]
FOC
\[ \def\sumn{\sum_{i=1}^{n}} \begin{align} \frac{\partial }{\partial \widehat{\beta}_0}=& 2 \sumn [y_i-(\widehat{\beta}_0+\widehat{\beta}_1 x_i)]=0 \\\\ \frac{\partial }{\partial \widehat{\beta}_1}=& 2 \sumn x_i\cdot [y_i-(\widehat{\beta}_0+\widehat{\beta}_1 x_i)]= \sumn x_i\cdot \widehat{u}_i = 0 \end{align} \]
OLS estimators: analytical formula
\[ \def\sumn{\sum_{i=1}^{n}} \begin{aligned} \widehat{\beta}_1 & = \frac{\sumn (x_i-\bar{x})(y_i-\bar{y})}{\sumn (x_i-\bar{x})^2},\\\\ \widehat{\beta}_0 & = \bar{y}-\widehat{\beta}_1 \bar{x}, \\\\ \mbox{where} & \;\; \bar{y} = \sumn y_i/n \;\; \mbox{and} \;\;\bar{x} = \sumn x_i/n \end{aligned} \]
Estimators
Specific rules (formula) to use once you get the data
Estimates
Numbers you get once you plug values (your data) into the formula
OLS Estimator Formula
\[ \def\sumn{\sum_{i=1}^{n}} \begin{aligned} \widehat{\beta}_1 & = \frac{\sumn (x_i-\bar{x})(y_i-\bar{y})}{\sumn (x_i-\bar{x})^2}\\\\ \widehat{\beta}_0 & = \bar{y}-\widehat{\beta}_1 \bar{x} \end{aligned} \]
R code
We can use the feols()
function from the fixest
package.
Lots of information is stored in the regression results (here, uni_reg
), which is of class list
.
Apply ls()
to see its elements:
Estimated coefficients:
Predicted values at the observation points:
Residuals:
You can have a nice quick summary of the regression results with summary()
function:
Model to be estimated
\[ price = \beta_0 + \beta_1 lotsize + u \]
Estimated Model
This is the estimated version of the expected value of \(y\) conditional on \(x\).
\[ price = 3.4136\times 10^{4} + 6.599 \times lotsize \]
This is called sample regression function (SRF) , and it is an estimation of \(E[price|lotsize]\), the population regression function (PRF).
Important
OLS regression predicts the expected value of the dependent variable conditional on the explanatory variables.
\(\widehat{\beta}_1\) is an estimate of how a change in \(x\) affects the expected value of \(y\).
You can access the predicted values at the observed points by looking at the fitted.value
element of the regression results.
To calculate the predicted value at arbitrary values of \(x\),
data.frame
with values of \(x\) of your choice.predict()
to the data.frame
using the regression results.\(R^2\) is a measure of how good your model is in predicting the dependent variable (explaining variations in the dependent variable) compared to just using the average of the dependent variable as the predictor.
You can decompose observed value of \(y\) into two parts: fitted value and residual
\[ y_i=\widehat{y}_i +\widehat{u}_i, \;\;\mbox{where}\;\; \widehat{y}_i = \widehat{\beta}_0+\widehat{\beta}_1 x_i \]
now, subtracting \(\bar{y}\) (sample average of \(y\)),
\[ y_i-\bar{y}=\widehat{y}_i-\bar{y}+\widehat{u}_i \]
total sum of squares (SST)
\[ SST\equiv \sum_{i=1}^{n}(y_i-\bar{y})^2 \]
explained sum of squares (SSE) \[ SSE\equiv \sum_{i=1}^{n}(\widehat{y}_i-\bar{y})^2 \]
residual sum of squares (SSR) \[ SSR\equiv \sum_{i=1}^{n}\widehat{u}_i^2 \]
Definition
\(R^2 = 1 - SSR/SST\)
Where did it come from?
\[\begin{align} & SST = SSE + SSR \\ \Rightarrow & SSE = SST - SSR \\ \Rightarrow & SSE/SST = 1 - SSR/SST = R^2\\ \end{align}\]The value of \(R^2\) always lies between \(0\) and \(1\) as long as an intercept is included in the econometric model.
What does it measure?
\(R^2\) is a measure of how much improvement in predictin the depdent variable you’ve made by including independent variable(s) \((y=\beta_0+\beta_1 x+u)\) compared to when simply using the mean of dependent variable as the predictor \((y=\beta_0+u)\).
Important
Problem