03-3: Reshaping and Merging

Reshaping


Long and wide formats

Long format

A single column representing a single variable

Wide format

Multiple column representing a single variable

Note: there is nothing in the dataset that tells you what the data values represent in the wide format.

Suppose you are interested in estimating the following statistical model:

\[corn yield = \beta_0 + \beta_1 R_{May} + \beta_2 R_{June} + \beta_2 R_{July} + \beta_2 R_{August} + \beta_2 R_{September} + v\]

where \(R\) refers to rainfall.

Then the following dataset is in a long format:

This is too long for your analysis.


Point

Whether a dataset is wide or long is determined based on what you are doing with the dataset.

Long to wide

Create the following dataset in long format:

How

To convert a long-formatted data.frame into a wide-formatted data.frame, you can use tidyr::pivot_wider() function from the tidyr package.

#--- NOT RUN ---#
tidyr::pivot_wider(
  data,
  names_from = variable 1,
  values_from = variable 2
)
  • the value of variable 1 becomes the name the new variables
  • the value of variable 2 becomes the value of the new variables

Example

  • the value of year becomes the name the new variables
  • the value of yield becomes the value of the new variables

You can append a character string to the new variable names. The previous example had 2019 and 2010 as the name of the new variables.

Create the following data in long format;

You can simply supply multiple variables to be made wide like this:

Wide to long

How

To convert a long-formatted data.frame into a wide-formatted data.frame, you can use tidyr::pivot_longer() function from the tidyr package.

#--- NOT RUN ---#
tidyr::pivot_longer(
  data, 
  cols = x,
  names_to = y, 
  values_to = z 
)
  • x: list of the name of the columns to pivot into longer format
  • y: what the name of x represents
  • z: what the values stored in x represents

Example

  • x: all the variables except state
  • y: “year”
  • z: “yield”

You do not want year_ in front of the year numbers in the new year variable? You can use the names_prefix option as follows:

Notice year is character. Convert it to numeric using as.numeric() if you use is as a numeric variable.

Long

Create the following dataset in the long format;

Wide

Convert the long dataset into the wide format:

Objective : We would like to convert the wide data back to the original long data.

You cannot revert this data back to the original long-formatted data in one step.

However, you take advantage of dplyr::separate() function, which separate a variable of type character by a user-specified separator into two variables in the dataset.

#--- NOT RUN ---#
separate(data, variable name, the name of variables, separator)

Before separation:

After separation:

After separating type_year to type and year, all you have to do is to apply tidyr::pivot_wider() to have the desired long-formatted data.

Exercises

We will use flights data from the nycflights13 package.


Using flights data, calculate the total number of flights by carrier-month, which is in the long format. Name the resulting object num_flights.


Work here


Answer

Code
num_flights <-
  flights %>% 
  group_by(carrier, month) %>% 
  summarize(num_obs = n()) 

Here is the output you are supposed to get if done correctly:

Reshape the num_flights data into a wide format with the number of flights per month as columns, and assign the result to an R object named num_flights_wide.


Work here


Answer

Code
num_flights %>% 
  tidyr::pivot_wider(
    names_from = month,
    names_prefix = "month_",
    values_from = num_obs
  ) 

Here is the output you are supposed to get if done correctly:

Reshape the data (num_flights_wide) back into the long format so that a single columns has all the flight number values


Work here


Answer

Code
num_flights_wide %>% 
  tidyr::pivot_longer(
    starts_with("month_"),
    names_to = "month",
    names_prefix = "month_",
    values_to = "num_flights"
  )

Here is the output you are supposed to get if done correctly:

Merging multiple datasets


Merging multiple datasets

It is very common that you have data stored in separate files, and you need to combine them before you conduct any statistical analysis.

For example, if you are interested in how crop price affects the supply of crops, you want to have price and production data in a single dataset. However, it may be that price and production data are stored in two separate files.

Now suppose, you have collected price and production data for Lancaster and Douglas County from 2015 to 2016.

Here is what the datasets look like (these are made-up numbers).


Question

Can you merge the two?

Let’s display one more variable from each of the datasets.


Okay, great. At least we know which price and prod belong to which county! In other words, we know which price and prod belong to who (or where).


Question

Can you merge the two?

Let’s display one more variable from each of the datasets.


Question

Can you merge the two now?


Key

  • The variables that let you merge two datasets are called keys.
  • What are the keys here?
  • You can use the left_join() function from the dplyr package to merge two datasets.

  • There are different types of join functions:

    • right_join() (you never need to use this one)
    • inner_join()
    • full_join()
    • semi_join()
    • nest_join()
  • But, most of the time, left_join() is sufficient.

  • Try to learn other functions when you encounter a case where left_join() is not sufficient. Do not waster your time until then.

Joining datasets with dplyr::left_join()

Syntax

#--- Syntax (NOT RUN) ---#
left_join(data_x, data_y, by = keys)  


Rules to be aware of

  • Rule 1: It returns all rows from data_x, and all columns from data_x and data_y
  • Rule 2: Rows in data_x with no match in data_y will have NA values in the new columns
  • Rule 3: If there are multiple matches between data_x and data_y, all combinations of the matches are returned

Note

The order of datasets matter.

We use price_data and yield_data for demonstrations.


Question

What are the keys?

The keys are county and year, so


Switching the two?


Note

In this instance, which comes first does not matter because all the individual rows in yield_data (left data) have exactly one match in price_data (right data) without fail, and vice versa.

Let’s expand the yield_data as follows:

yield_data_with_chase on the right:


yield_data_with_chase on the left:


Remember?

  • Rule 1: left_join() returns all rows from data_x, and all columns from data_x and data_y
  • Rule 2: Rows in data_x with no match in data_y will have NA values in the new columns

We saw in the previous slide having price_data (as data_x) and yield_data_with_chase as (data_y), left_join() discarded rows in yield_data_with_chase (data_y).

If you would like to keep unmatched rows in data_y, you can use full_join().

Let’s create a weather dataset where you have more than one observations per county-year:

Remember?

Rule 3: If there are multiple matches between data_x and data_y, all combinations of the matches are returned.

Create the following datasets and take a look at them to understand what’s in them:


Flights in January:

daily temperature in January:

hourly temperature in January:

  • You are interested in learning the impact of daily temperature on departure delay for the flights in January. To do so, you need to have the variables in a single dataset.

  • Is this going to be a 1-to-1 matching or 1-to-m matching?

  • Merge daily_temp_Jan to flights_Jan using left_join()


Work here


Here is the output you are supposed to get if done correctly:


Answer

Code
left_join(flights_Jan, daily_temp_Jan, by = c("origin", "month", "day"))
  • You are interested in learning the impact of hourly temperature on departure delay for the flights in January. To do so, you need to have them in a single dataset.

  • Is this going to be a 1-to-1 matching or 1-to-m matching?

  • Merge hourly_temp_Jan to flights_Jan using left_join()


Work here


Here is the output you are supposed to get if done correctly:


Answer

Code
left_join(flights_Jan, hourly_temp_Jan, by = c("origin", "month", "day"))