10: Panel Data Methods

Panel (longitudinal) Data Methods

Panel data
Example panel data
Central Question

Definition

Data follow the same individuals, families, firms, cities, states or whatever, across time.

Example

Randomly select people from a population at a given point in time
Then the same people are reinterviewed at several subsequent points in time, which would result in data on wages, hours, education, and so on, for the same group of people in different years.

Panel Data as data.frame

year: year
fcode: factory id
employ: the number of employees
sales: sales in USD

Can we do anything to deal with endogeneity problem taking advantage of the panel data structure?

Demand for massage (cross-sectional)

Location	Year	P	Q
Chicago	2003	75	2.0
Peoria	2003	50	1.0
Milwaukee	2003	60	1.5
Madison	2003	55	0.8

P: the price of one massage
Q: the number of massages received per capita

Question 1
Question 2

Across the four cities, how are price and quantity are associated? Positive or negative?

Answer

They are positively correlated. So, does that mean people want more massages as their price increases? Probably not.

What could be causing the positive correlation?

Answer

Income (can be observed)
Quality of massages (hard to observe)
How physically taxing jobs are (?)

Demand for massage (cross-sectional)

Location	Year	P	Q	Ql
Chicago	2003	75	2.0	10
Peoria	2003	50	1.0	5
Milwaukee	2003	60	1.5	7
Madison	2003	55	0.8	6

Key

Massage quality was hidden (omitted) affecting both price and massages per capita.

Problem

Massage quality is not observable, and thus cannot be controlled for.

Mathematically

\(Q = \beta_0 + \beta_1 P + v \;\;( = \beta_2 + Ql + u)\)

\(P\): the price of one massage
\(Q\): the number of massages received per capita
\(Ql\): the quality of massages
\(u\): everything else that affect \(P\)

Endogeneity Problem

\(P\) is correlated with \(Ql\).

Location	Year	P	Q	Ql
Chicago	2003	75	2.0	10
Chicago	2004	85	1.8	10
Peoria	2003	50	1.0	5
Peoria	2004	48	1.1	5
Milwaukee	2003	60	1.5	7
Milwaukee	2004	65	1.4	7
Madison	2003	55	0.8	6
Madison	2004	60	0.7	6

Key
within-city
Question

There are two kinds of variations:

inte-rcity (across city) variation
intra-city (within city) variation

The cross-sectional data offers only the inte-rcity (across city) variations.

Now, compare the massage price and massages per capita within each city (over time). What do you see?

Answer

Price and quantity are negatively correlated!

Why looking at the intra-city (within city) variation seemed to help us estimate the impact of massage price on demand more credibly?

Answer

The omitted variable, massage quality, did not change over time within city, which means it is controlled for as long as you look only at the intra-city variations (you do not compare across cities).

Using only the intra-city variations

first-differencing
first-differenced model
Estimate the model
Summary

Question

So, how do we use only the intra-city variations in a regression framework?

first-differencing

One way to do this is to compute the changes in prices and th changes in quantities in each city \((\Delta P\) and \(\Delta Q)\) and then regress \(\Delta Q\) and \(\Delta P\).

First-differenced Data

Location	Year	P	Q	Ql	P_dif	Q_dif	Ql_dif
Chicago	2003	75	2.0	10	NA	NA	NA
Chicago	2004	85	1.8	10	10	-0.2	0
Peoria	2003	50	1.0	5	NA	NA	NA
Peoria	2004	48	1.1	5	-2	0.1	0
Milwaukee	2003	60	1.5	7	NA	NA	NA
Milwaukee	2004	65	1.4	7	5	-0.1	0
Madison	2003	55	0.8	6	NA	NA	NA
Madison	2004	60	0.7	6	5	-0.1	0

Key

Variations in quality is eliminated after first differentiation!! (quality is controlled for)

A new way of writing a model

\(Q_{i,t} = \beta_0 + \beta_1 P_{i,t} + v_{i,t} \;\; ( = \beta_2 Ql_{i,t} + u_{i,t})\)

i: indicates city
t: indicates time

First differencing

\(Q_{i,1} = \beta_0 + \beta_1 P_{i,1} + v_{i,1} \;\; ( = \beta_2 Ql_{i,1} + u_{i,1})\)

\(Q_{i,2} = \beta_0 + \beta_1 P_{i,2} + v_{i,2} \;\; ( = \beta_2 Ql_{i,2} + u_{i,2})\)

\(\Rightarrow\)

\(\Delta Q = \beta_1 \Delta P + \Delta v ( = \beta_2 \Delta Ql + \Delta u)\)

Endogeneity Problem?

Since \(Ql_{i,1} = Ql_{i,2}\), \(\Delta Ql = 0 \Rightarrow \Delta Q = \beta_0 + \beta_1 \Delta P + \Delta u\)

No endogeneity problem after first differentiation!

Data

OLS on the original data:

OLS on the first-differenced data:

As long as the omitted variable that affects both the dependent and independent variables are constant over time (time-invariant), then using only the variations over time (ignoring variations across cross-sectional units) can eliminate the omitted variable bias
First-differencing the data and then regressing changes on changes does the trick of ignoring variations across cross-sectional units
Of course, first-differencing is possible only because the same cross-sectional units are observed multiple times over time.

Multi-year (general) panel datasets

deviation from the mean
Fixed Effects Regression

within-transformation

If we have lots of years of data, we could, in principle, compute all of the first differences (i.e., 2004 versus 2003, 2005 versus 2004, etc.) and then run a single regression. But there is an easier way.
Instead of thinking of each year’s observation in terms of how much it differs from the prior year for the same city, let’s think about how much each observation differs from the average for that city.

Example

How much each observation differs from the average for that city?

Location	Year	P	P_mean	P_dev	Q	Q_mean	Q_dev	Ql	Ql_mean
Chicago	2003	75	80.0	-5.0	2.0	1.90	0.10	10	10
Chicago	2004	85	80.0	5.0	1.8	1.90	-0.10	10	10
Peoria	2003	50	49.0	1.0	1.0	1.05	-0.05	5	5
Peoria	2004	48	49.0	-1.0	1.1	1.05	0.05	5	5
Milwaukee	2003	60	62.5	-2.5	1.5	1.45	0.05	7	7
Milwaukee	2004	65	62.5	2.5	1.4	1.45	-0.05	7	7
Madison	2003	55	57.5	-2.5	0.8	0.75	0.05	6	6
Madison	2004	60	57.5	2.5	0.7	0.75	-0.05	6	6

Note

We call this data transformation within-transformation or demeaning .

Example
within-transformation
Endogeneity Problem?

Model

Dependent variable: Q_dev
Independent variable: P_dev

Key

In calculating P_dev (deviation from the mean by city), Ql_dev is eliminated.

\(Q_{i,1} = \beta_0 + \beta_1 P_{i,1} + v_{i,1} \;\; ( = \beta_2 Ql_{i,1} + u_{i,1})\)

\(Q_{i,2} = \beta_0 + \beta_1 P_{i,2} + v_{i,2} \;\; ( = \beta_2 Ql_{i,2} + u_{i,2})\)

\(\vdots\)

\(Q_{i,T} = \beta_0 + \beta_1 P_{i,T} + v_{i,T} \;\; ( = \beta_2 Ql_{i,T} + u_{i,T})\)

\(\Rightarrow\)

\(Q_{i,t} - \bar{Q_{i}} = \beta_1 [P_{i,t} - \bar{P_{i}}] + [v_{i,t} - \bar{v_{i}}] ( = \beta_2 [Ql_{i,t} - \bar{Ql_{i}}] + [u_{i,t} - \bar{u_{i}}])\)

\(Ql_{i,1} = Ql_{i,2} = \dots = Ql_{i,T} = \bar{Ql_i}\)

\(\Rightarrow\)

\(Q_{i,t} - \bar{Q_{i}} = \beta_1 [P_{i,t} - \bar{P_{i}}] + [u_{i,t} - \bar{u_{i}}]\)

No endogeneity problem after the within-transformation because \(Ql\) is gone.

Fixed Effects (FE) Estimation (in general)

FE estimation
When is FE estimation unbiased?
Example

Consider the following general model

\(y_{i,t}=\beta_1 x_{i,t} + \alpha_i + u_{i,t}\)

\(\alpha_i\): the impact of time-invariant unobserved factor that is specific to \(i\) (also termed individual fixed effect)
\(\alpha_i\) is thought to be correlated with \(x_{i,t}\)

Find individual averages
within-transformation
Estimation of \(\beta\)s

For each \(i\), average this equation over time, we get

\(\frac{\sum_{t=1}^T y_{i,t}}{T} = \frac{\sum_{t=1}^T x_{i,t}}{T} + \alpha_i + \frac{\sum_{t=1}^T u_{i,t}}{T}\)

We use \(\bar{z}_i\) to indicate the average of \(\bar{z}_{i,t}\) over time for individual \(i\). Using this notation,

\(\bar{y}_i = \bar{x}_i + \alpha_i + \bar{u}_i\)

Note that \(\frac{\sum_{t=1}^T \alpha_{i}}{T} = \alpha_i\)

Subtracting the equation of the average from the original model,

\((y_{i,t}-\bar{y}_i=\beta_1 (x_{i,t} -\bar{x}_i) + (u_{i,t} -\bar{u}_i) + a_i - a_i\)

Important

\(\alpha_i\) is gone!

We then regress \((y_{i,t}-\bar{y}_i)\) on \((x_{i,t}-\bar{x}_i)\) to estimate \(\beta_1\).

Here is the model after within-transformation:

\[\begin{align*} y_{i,t}-\bar{y}_i=\beta_1 (x_{i,t} -\bar{x}_i) + (u_{i,t} -\bar{u}_i) \end{align*}\]

So,

\(x_{i,t} -\bar{x}_i\) needs to be uncorrelated with \(u_{i,t} -\bar{u}_i\).

Important

The above condition is satisfied if

\(E[u_{i,s}|x_{i,t}] = 0 \;\; ^\forall s, \;\; t, \;\;\mbox{and} \;\;j\)

e.g., \(E[u_{i,1}|x_{i,4}]=0\)

Fixed effects estimation

Regress within-transformed Q on within-transformed P:

An alternative way to view the Fixed Effects estimation methods

two equivalent models
Estimation with the alternative model
What does this tell us?

Important

The two approaches below will result in the same coefficient estimates (mathematically identical).

Running OLS on the within-tranformed (demeaned) data
Running OLS on the untransformed data but including the dummy variables for the individuals (city in our example)

You can use the original data (no within-transformation) and include dummy variables for all the cities except one.

Create dummy variables
Estimate with dummy variables

Note that the coefficient estimate on P is exactly the same as the one we saw earlier when we regressed Q_dev on P_dev.

By including individual dummies (individual fixed effects), you are effectively eliminating the between (inter-city) variations and using only the clean within (within-city) variations for estimation.

Very Important

More generally, including dummy variables of a categorical variable (like city in the example above), eliminates the variations between the elements of the category (e.g., different cities), and use only the variations within each of the element of the category.

Fixed Effects Estimation in Practice Using R

How
Demonstration
Random Effects (RE) Model

Advice

Do not within-transform the data yourself and run a regression
Do not create dummy variables yourself and run a regression with the dummies

In practice

We will use the fixest package.

Syntax

fixest::feols(dep_var ~ indep_vars | FE, data)

FE: the name of the variable that identifies the cross-sectional units that are observed over time (Location in our example)
dep_var: (non-transformed) dependet variable
indep_vars: list of (non-transformed) independent variables

Data

Example

How is it different from the FE model?

Can be more efficient (lower variance) than FE under certain cases
If \(\alpha_i\) and independent variables are correlated, then RE estimators are biased
Unless \(\alpha_i\) and independent variables are not correlated (which does not hold most of the time unless you got data from controlled experiments), \(RE\) is not an attractive option
You almost never see this estimation method used in papers that use non-experimental data

Note

We do not cover this estimation method as you almost certainly would not use this estimation method.

Fixed effects (dummy variables) to harness clean variations

Avocado data
Observations
Endogeneity problem
Clean variations
What FE to include
Key message

Weekly Sales of Avocados in California, Jan 2015 - Match 2018

Objective

You are interested in understanding the impact of avocado price on its consumption.

Observations

They are negatively associated with each other
- Avocado sales tend to be lower in weeks where the price of avocados is high.
- Prices tend to be higher in weeks where fewer avocados are sold

Question

If you just regress avocado sales on its price, is the estimation of the coefficient on the pirce unbiased?

Answer

No.

Reverse causality
- price affects demand
- demand affects price

Problem

Reverse Causality: Price affects demand and demand affects price.

contextual knowledge

Now, suppose you learned the following fact after studying the supply and purchasing mechanism on the avocado market:

At the beginning of each month, avocado suppliers make a plan for what avocado prices will be each week in that month, and never change their plans until the next month.

This means that within the same month changes in avocado price every week is not a function of how much avocado has been bought in the previous weeks, effectively breaking the causal effect of demand on price.

So, our estimation strategy would be to just look at the variations in demand and price within individual months, but ignore variations in price between months.

The figure below presents avocado sales and price of avocado in March, 2015. This is an example of clean variations in price (intra-month observations).

Case 1
Case 2

We have three months of avocado purchase and price observed weekly.

Question

What should we do?

Answer

Include month dummy variables.

We have two years of avocado purchase and price observed weekly.

Question

What should we do?

Answer

Include month-year dummy variables.
Including month dummy variables will not do it. Because the observations in the same month in two different years are considered to belong to the same group. That is, variations between two different years of the same month will be used for estimation. (e.g., January in 2014 and January in 2015)

Message 1

By understanding the data generating process (knowing how any economic market works), we recognize the problem of simply looking at the relationship between the avocado price and demand to conclude the causal impact of price on demand (reverse causality).

Message 2

We study the context very well and how the avocado market works in California (of course it is not really how CA avocado market works in reality) and make use of the information to identify the “clean” variations in avocado price to identify its impact on demand.

Year Fixed Effects

What is it?
What do year FEs do?
Recommendation
R implementation

Just a collection of year dummies, which takes 1 if in a specific year, 0 otherwise.

They capture anything that happened to all the individuals for a specific year relative to the base year

Example

Education and wage data from \(2012\) to \(2014\),

\(log(income) = \beta_0 + \beta_1 educ + \beta_2 exper + \sigma_1 FE_{2012} + \sigma_2 FE_{2013}\)

\(\sigma_1\): captures the difference in \(log(income)\) between \(2012\) and \(2014\) (base year)
\(\sigma_2\): captures the difference in \(log(income)\) between \(2013\) and \(2014\) (base year)

Interpretation

\(\sigma_1=0.05\) would mean that \(log(income)\) is greater in \(2012\) than \(2014\) by \(5\%\) on average for whatever reasons with everything else fixed.

Recommendation

It is almost always a good practice to include year FEs if you are using a panel dataset with annual observations.

Why?

Remember year FEs capture anything that happened to all the individuals for a specific year relative to the base year
In other words, all the unobserved factors that are common to all the individuals in a specific year is controlled for (taken out of the error term)

Example

Economic trend in:

\(log(income) = \beta_0 + \beta_1 educ + \sigma_1 FE_{2012} + \sigma_2 FE_{2013}\)

Education is non-decreasing through time
Economy might have either been going down or up during the observed period

Without year FE, \(\beta_1\) may capture the impact of overall economic trend.

In order to include year FEs to individual FEs, you can simply add the variable that indicates year like below:

Caveats

Year FEs would be perfectly collinear with variables that change only across time, but not across individuals.
If your variable of interest is such a variable, you cannot include year FEs, which would then make your estimation subject to omitted variable bias due to other unobserved yearly-changing factors.

Standard Error Estimation for Panel Data Methods

non-homegeneous error
Consequences?
What to do and R implementation

Heteroskedasticity

Just like we saw for OLS using cross-sectional data, heteroskedasticity leads to biased estimation of the standard error of the coefficient estimators if not taken into account

Serial Correlation

Correlation of errors over time, which we call serial correlation

just like heteroskedasticity, serial correlation could lead to biased estimation of the standard error of the coefficient estimators if not taken into account
do not affect the unbiasedness and consistency property of your estimators

Important

Taking into account the potential of serial correlation when estimating the standard error of the coefficient estimators can dramatically change your conclusions about the statistical significance of some independent variables!!
When serial correlation is ignored, you tend to underestimate the standard error (why?), inflating \(t\)-statistic, which in turn leads to over-rejection that you should.

Bertrand, Duflo, and Mullainathan (2004)

Examined how problematic serial correlation is in terms of inference via Monte Carlo simulation
- generate a fake treatment dummy variable in a way that it has no impact on the outcome (dependent variable) in the dataset of women’s wages from the Current Population Survey (CPS)
- run regression of the oucome on the treatment variable
- test if the treatment variable has statistically significant effect via \(t\)-test
They rejected the null \(67.5\%\) at the \(5\%\) significance level!!

SE robust to heteroskedasticity and serial correlation

You can take into account both heteroskedasticity and serial correlation by clustering by individual (whatever the unit of individual is: state, county, farmer)
Cluster by individual can take into account the correlation within individuals (over time)

R implementation

The last partition is used for clustering standard error estimation by variable like below.