Long format
A single column representing a single variable
Wide format
Multiple column representing a single variable
Note: there is nothing in the dataset that tells you what the data values represent in the wide format.
Suppose you are interested in estimating the following statistical model:
\[corn yield = \beta_0 + \beta_1 R_{May} + \beta_2 R_{June} + \beta_2 R_{July} + \beta_2 R_{August} + \beta_2 R_{September} + v\]
where \(R\) refers to rainfall.
Then the following dataset is in a long format:
This is too long for your analysis.
Point
Whether a dataset is wide or long is determined based on what you are doing with the dataset.
Create the following dataset in long format:
How
To convert a long-formatted data.frame
into a wide-formatted data.frame
, you can use tidyr::pivot_wider()
function from the tidyr
package.
variable 1
becomes the name the new variablesvariable 2
becomes the value of the new variablesExample
year
becomes the name the new variablesyield
becomes the value of the new variablesYou can append a character string to the new variable names. The previous example had 2019
and 2010
as the name of the new variables.
Create the following data in long format;
You can simply supply multiple variables to be made wide like this:
How
To convert a long-formatted data.frame
into a wide-formatted data.frame
, you can use tidyr::pivot_longer()
function from the tidyr
package.
x
: list of the name of the columns to pivot into longer formaty
: what the name of x
representsz
: what the values stored in x
representsExample
x
: all the variables except state
y
: “year”z
: “yield”You do not want year_
in front of the year numbers in the new year
variable? You can use the names_prefix
option as follows:
Notice year
is character. Convert it to numeric using as.numeric()
if you use is as a numeric variable.
Long
Create the following dataset in the long format;
Wide
Convert the long dataset into the wide format:
Objective : We would like to convert the wide data back to the original long data.
You cannot revert this data back to the original long-formatted data in one step.
However, you take advantage of dplyr::separate()
function, which separate a variable of type character by a user-specified separator into two variables in the dataset.
Before separation:
After separation:
After separating type_year
to type
and year
, all you have to do is to apply tidyr::pivot_wider()
to have the desired long-formatted data.
We will use flights
data from the nycflights13
package.
Using flights
data, calculate the total number of flights by carrier-month, which is in the long format. Name the resulting object num_flights
.
Work here
Answer
Here is the output you are supposed to get if done correctly:
Reshape the num_flights
data into a wide format with the number of flights per month as columns, and assign the result to an R object named num_flights_wide
.
Work here
Answer
Here is the output you are supposed to get if done correctly:
Reshape the data (num_flights_wide
) back into the long format so that a single columns has all the flight number values
Work here
Answer
Here is the output you are supposed to get if done correctly:
It is very common that you have data stored in separate files, and you need to combine them before you conduct any statistical analysis.
For example, if you are interested in how crop price affects the supply of crops, you want to have price and production data in a single dataset. However, it may be that price and production data are stored in two separate files.
Now suppose, you have collected price and production data for Lancaster and Douglas County from 2015 to 2016.
Here is what the datasets look like (these are made-up numbers).
Question
Can you merge the two?
Let’s display one more variable from each of the datasets.
Okay, great. At least we know which price and prod belong to which county! In other words, we know which price and prod belong to who (or where).
Question
Can you merge the two?
Let’s display one more variable from each of the datasets.
Question
Can you merge the two now?
Key
You can use the left_join()
function from the dplyr
package to merge two datasets.
There are different types of join functions:
right_join()
(you never need to use this one)inner_join()
full_join()
semi_join()
nest_join()
But, most of the time, left_join()
is sufficient.
Try to learn other functions when you encounter a case where left_join()
is not sufficient. Do not waster your time until then.
dplyr::left_join()
Syntax
Rules to be aware of
data_x
, and all columns from data_x
and data_y
data_x
with no match in data_y
will have NA values in the new columnsdata_x
and data_y
, all combinations of the matches are returnedNote
The order of datasets matter.
We use price_data
and yield_data
for demonstrations.
Question
What are the keys?
The keys are county
and year
, so
Switching the two?
Note
In this instance, which comes first does not matter because all the individual rows in yield_data
(left data) have exactly one match in price_data
(right data) without fail, and vice versa.
Let’s expand the yield_data
as follows:
yield_data_with_chase
on the right:
yield_data_with_chase
on the left:
Remember?
left_join()
returns all rows from data_x
, and all columns from data_x
and data_y
data_x
with no match in data_y
will have NA values in the new columnsWe saw in the previous slide having price_data
(as data_x
) and yield_data_with_chase
as (data_y
), left_join()
discarded rows in yield_data_with_chase
(data_y
).
If you would like to keep unmatched rows in data_y
, you can use full_join()
.
Create the following datasets and take a look at them to understand what’s in them:
Flights in January:
daily temperature in January:
hourly temperature in January:
You are interested in learning the impact of daily temperature on departure delay for the flights in January. To do so, you need to have the variables in a single dataset.
Is this going to be a 1-to-1 matching or 1-to-m matching?
Merge daily_temp_Jan
to flights_Jan
using left_join()
Work here
Here is the output you are supposed to get if done correctly:
Answer
You are interested in learning the impact of hourly temperature on departure delay for the flights in January. To do so, you need to have them in a single dataset.
Is this going to be a 1-to-1 matching or 1-to-m matching?
Merge hourly_temp_Jan
to flights_Jan
using left_join()
Work here
Here is the output you are supposed to get if done correctly:
Answer