00: Introduction to Econometrics

What is econometrics about?

What econometrics is about

What?
Steps in Econometric Analysis

What are we doing?

Estimate quantitative relationships between variables.

Examples

the impact of fertilizer on crop yield
the impact of political campaign expenditure on voting outcomes
the impact of education on wage

formulation of the question of interest (what are you trying to find out?)
develop an economic model of the phenomenon you are interested in understanding (identify variables that matter)
turn the economic model into an econometric model
collect data
estimate the model using econometrics
test hypotheses

Example: Job training and worker productivity

\[wage = f(educ,exper,training)\]

\(wage\): hourly wage
\(educ\): years of formal education
\(exper\): years of workforce experience
\(training\): weeks spent in job training

Note

Depending on questions you would like to answer, the economic model can (and should) be much more involved

We have built a conceptual model:

\[wage = f(educ,exper,training)\]

Now, the form of the function \(f(\cdot)\) must be specified (almost always) before we can undertake an econometric analysis

\[ wage = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 training + u \]

\(\beta_0,\beta_1,\beta_2,\beta_3\)

are the parameters of the econometric model.
describe the directions and strengths of the relationship between \(wage\) and the factors used to determine \(wage\) in the model

\(u\)

is called error term
includes ALL the other factors that can affect wage other than the included variables (like innate ability)

We can collect data using various ways. Some of them include survey, websites, experiments. Let’s look at different data types:

Cross-sectional Data
Time-series Data
Panel (Longitudinal) Data

Sample of individuals, households, firms, cities, states, countries, or a variety of other units, taken at a given point in time
The data on all units do not correspond to precisely the same time period
- some families surveyed during different weeks within a year

What a cross-sectional data looks like on R

      wage  educ exper female married
     <num> <int> <int>  <int>   <int>
  1:  3.10    11     2      1       0
  2:  3.24    12    22      1       1
  3:  3.00    11     2      0       0
  4:  6.00     8    44      0       1
  5:  5.30    12     7      0       1
 ---                                 
522: 15.00    16    14      1       1
523:  2.27    10     2      1       0
524:  4.67    15    13      0       1
525: 11.56    16     5      0       1
526:  3.50    14     5      1       0

Observations on a variable or several variables over time + corn price + oil price

Note

The econometric frameworks necessary to analyze time series data are quite different from those for cross-sectional data
We do NOT learn time-series econometric methods

Time series data for each cross-sectional member in the data set ( same cross-sectional units are tracked over a given period of time)

Example

wage data for individuals collected every five years over the past 30 years
yearly GDP data for 60 countries over the past 10 years

What a panel data looks like on R

     county  year    crmrte   prbarr  prbpris
      <int> <int>     <num>    <num>    <num>
  1:      1    81 0.0398849 0.289696 0.472222
  2:      1    82 0.0383449 0.338111 0.506993
  3:      1    83 0.0303048 0.330449 0.479705
  4:      1    84 0.0347259 0.362525 0.520104
  5:      1    85 0.0365730 0.325395 0.497059
 ---                                         
626:    197    83 0.0155747 0.226667 0.428571
627:    197    84 0.0136619 0.204188 0.372727
628:    197    85 0.0130857 0.180556 0.333333
629:    197    86 0.0128740 0.112676 0.244444
630:    197    87 0.0141928 0.207595 0.360825

This is what you learn for the next few months!!

estimate the model using econometrics
test hypothesis

Causality and Association

Distinction between causality and association
Glasses?

Association

An association of two variables arise because either of or both variables affect the other variable

\[\begin{align} A \longleftrightarrow B \\ A \longrightarrow B \\ A \longleftarrow B \end{align}\]

Association does NOT concern which affects which. Under all the three cases above, A and B are associated. Or, we say there is an association between A and B. This is what correlation coefficient measures.

Causality

When A has a causal impact on B,

\[ A \longrightarrow B \]

Here, changes in \(A\) cause changes in \(B\), not the other way around

Video
Claims
But,

Let’s watch this interesting CM.

People who wear glasses are

much smarter than those who don’t
more likely to pursue higher education
200% more likely to graduate college

For you to be convinced to buy glasses, these claims needs to be causal, not association:

Does wearing glasses make you much smarter?
Does wearing glasses make it more likely for you to pursue higher education?
Does wearing glasses make it 200% more likely for you to graduate college?

However, this seems to be a more likely explanation of the association:

One spends more time studying academic subjects
- smarter (or knowledgeable) \(\Rightarrow\) pursue higher education and graduate college
- worsened eyesight \(\Rightarrow\) wear glasses

Important

We care about isolating causal effects, but not association
Identifying association is super easy
Identifying causal effects is extremely hard (this is what we tackle)

Endogeneity: Your Nemesis

Endogeneity
Example
What happened?
Endogeneity
Another example

It is super easy to find an association of multiple variables, but it is incredibly hard to find a causal effect (at least in Economics)!!

That is due to the problem called endogeneity , which is going to be defined formally later.

You are interested in the causal impact of fire fighters on the number of death tolls in fire events

fire event	death toll	# of firefighters deployed
1	10	20
2	0	3
3	5	10
4	3	5
5	50	50

Questions

How are they associated ?
Can you say anything about the causal effect of fire fighters deployment on the number of death tolls?

You ignored an important variable!!

fire event	death toll	# of firefighters deployed	scale of fire
1	10	20	20
2	0	3	5
3	5	10	20
4	3	5	10
5	50	50	100

Definition

Variables of interest are correlated with some unobservables (variables that cannot be observed or are missing) that have non-zero impacts on the variable that you want to explain

The unobserved variables are also called confounder/confounding factor .

The example

In the the firefighter example,

variable of interest : the number of firefighters
unobservables/confounder : the scale of fire events (and other factors)
variable to explain : death toll

The model

\[\begin{align} \mbox{death toll} & = \alpha + \beta\; \mbox{# of fire fighters} + \mu\\ ,\mbox{where } \mu & = (\gamma\; \mbox{scale} + v) \mbox{ is the error term (collection of unobservables)} \end{align}\]

Endogeneity Problem

# of fire fighters is correlated with scale, which we ignored

\[wage = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 training + u\]

What are unobservables in \(u\) that are likely to be correlated with \(educ\)?

An important unobservable

innate ability \(\Rightarrow\) wage
innate ability \(\Rightarrow\) education

How to deal with endogeneity

The question
How to deal with endogeneity?

Problem

Most of the time, you will be faced with endogeneity problems caused by at least one of the followings,

omitted variables (the scale of fire events, innate ability)
self-selection
simultaneity
measurement error

Central Question

How can we avoid or solve endogeneity problems?

You have two opportunities to deal with endogeneity problems
- at the design (design to collect data) stage
- at the regression stage (what you will learn in this course)
Econometrics has evolved mostly to address endogeneity problems at the regression stage because randomized experiments are infeasible most of the time
How about econometrics and other fields of statistics: Statistics, Psychometrics, and Biometrics?

Field	Design	Estimation Method
Econometrics	not feasible (often)	intricate
Many other fields	feasible	relatively simple

Randomized-experiments

In randomized experiments,

you have a liberty to determine the level of the variable of interest
by randomizing the value of the variable of interest, you can effectively break the link (association) with whatever is included in the error term

Example (Non-Randomized)
Randomized
Randomized Experiments on Education?

Data
Farmer’s decision
Bias

Yield and nitrogen rate data obtained from a field that is managed by a farmer

Farmer

decide nitrogen rate based on soil/field characteristics (some of them we researchers do not get to observe)

Researcher

soil characteristics is not observable, so it is in the error term

\[yield = \beta_0 + \beta_1 N + (\gamma SC + \mu)\]

N (nitrogen rate) and SC (soil characteristics) are correlated

Suppose the farmer applied more nitrogen to the area where its soil characteristics lead to higher corn yield

Question If the researcher estimate the model (which ignores soil characteristics), do you over- or under-estimate the impact of nitrogen rate on corn yield?

Important

Soil quality (in error term) is no longer correlated with N!!

Randomized Experiment?

Researchers determine randomly how much education subjects (people) can get?

Endogeneity Problem in Economics

Economics is about understanding human behavior
Almost always, you need to deal with endogeneity problem because people are smart: we make decisions based on available information (not just randomly) so that our decisions lead to good outcomes (whether our decisions turn out to be good or not is irrelevant)
- how much education one get is determined based on their judgment of their own ability (not by rolling a dice)
- how many fire fighters to be deployed was determined based on the scale of fire (not by rolling a dice)
- how much nitrogen to apply based on soil characteristics (not by rolling a dice)
If people are not smart and just roll a dice for their decision making, we would have much easier time identifying causal effects