03-2: Data wrangling with tidyverse: The Basics

Tips to make the most of the lecture notes

  • Click on the three horizontally stacked lines at the bottom left corner of the slide, then you will see table of contents, and you can jump to the section you want

  • Hit letter “o” on your keyboard and you will have a panel view of all the slides

  • The box area with a hint of blue as the background color is where you can write code (hereafter referred to as the “code area”).
  • Hit the “Run Code” button to execute all the code inside the code area.
  • You can evaluate (run) code selectively by highlighting the parts you want to run and hitting Command + Enter for Mac (Ctrl + Enter for Windows).
  • If you want to run the codes on your computer, you can first click on the icon with two sheets of paper stacked on top of each other (top right corner of the code chunk), which copies the code in the code area. You can then paste it onto your computer.
  • You can click on the reload button (top right corner of the code chunk, left to the copy button) to revert back to the original code.

tidyverse package

  • The tidyverse is a package of a collection of packages. Some of the packages included are

    • readr: read datasets in various formats
    • dplyr: manipulate and merge datasets
    • tidyr: reshape datasets
    • ggplot2: data visualization
    • stringr: character string operations
  • These packages are by no means the only ways to do the operations we will learn today. However, we learn these packages because they are easy to use.

  • They are also easy to learn and get help as they are extremely popular and very well-documented. Go to the tidyverse website, and click on the icon of the package you would like to learn, then you will have a nice documentation of the packages.

  • You could also google the package name, and you will see bunch of introductions/tutorials.

Install the package if you have not.

install.packages("tidyverse")


When you load the tidyverse package, it automatically loads many of the packages contained in it.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Sometimes packages share the same functions names.

When those packages are loaded, there are function name conflicts where the function from the package loaded later masks the function from the other package.

data.frame and tibble


data.frame and tibble

data.frame

  • The traditional (most common) class of two-dimensional data (rows and columns) supported by R.


tibble

  • A relatively newer class of two-dimensional data that brings minor enhancements over data.frame.

  • tibble is defined by the tibble package, which is part of the tidyverse package.

  • Learn basic operations on data.frame and tibble

  • Highlight some differences between the two (They are almost interchangeable, and you do no need to pay much attention to which class of data you are dealing with)

We use mtcars dataset. It is readily available to you.


You can convert a data.frame into a tibble using as_tibble():


mtcars is a tibble (and also a data.frame).

display method

Accessing parts of the data

Subsetting a data.frame works in a very similar manner as a matrix:

matrix[rows of interest, columns of interest]

Example:

2nd to 8th rows, the 3rd and 5th columns

data.frame

tibble

But, subsetting a data.frame using numerical indexes are not recommended because it is not immediately clear to you (or your collaborators who might look at the code) what you intended to do with the code.

Instead, the following is better:

data[, c("variable 1", "variable 2", ...)]

Examples:

data.frame

tibble

Alternatively, to access a column (variable) from a dataset, you can use $ like below (remember? A data.frame is a special kind of list).

Examples:

data.frame

tibble

Data manipulation using the dplyr package


Data Preparation

We use flights, which is from the nycflights13 package.

We also use flights_mini, which is created by running the code below.


You do not have to understand the code just yet. You will, once you have completed the lecture. Just inspect the data and familiarize yourself with it.

Piping with %>%

Let f() be a function and x is an R object that f() accepts. Then,

x %>% f() is the same as f(x)

Try the following codes and confirm they return the same results:


Note

The piping operator %>% is actually from the magrittr package. But, when you load the dplyr package, it is automatically loaded as well.

Suppose you have more than one arguments to the function like this:

f(x1, x2, option 1, option 2)


Then,

z %>% f(x2, option 1, option 2)


is equivalent to

f(z, x2, option 1, option 2) 


Important

That is, in general, an R object that precedes the piping operator (%>%) becomes the first argument of the function that comes after the piping operator.

What if the object before the piping operator is not the first argument of the subsequent function?


This does not work:


because the above is equivalent to


You can refer to the preceding object by . like this:

You can keep piping like this:


Important

The object created by all the codes preceding the piping operator is passed down to the function after the piping operator.

For example, relocate(dep_time) receives as its first argument the outcome of the evaluation of the highlighted parts of the code below.

Consider the following sequence of actions:


  • Notice that you generated two intermediate datasets (a1 and a2) to obtain the dataset you wanted (a3).

  • These intermediate objects are generated only for the purpose of generating the final dataset.

  • It is easy to imagine that you will soon have lots of unnecessary intermediate objects on R.

Alternatively, you can do the following:


  • This does not create any intermediate objects unlike the first example.

  • However, it can be difficult to understand the code because the order of execution is the reverse of the order in which the functions are written when you read the code from left to right.

Taking advantage of the piping operator,


  • This is much easier to read as the order of execution is the same as the order in which the functions are written.

Data wrangling with the dplyr package

The dplyr packages provides useful functions to transform data.

  • filter(): select rows that satisfy user-specified conditions
  • select(): keep (remove) only the variables the user specified
  • mutate(): create (over-write) a variable based on user-specified formula
  • rename(): rename variables
  • arrange(): sort by variables specified by the user

Note

There are other useful functions. But, we limit our attention to only the above as they are sufficient to keep up with the main lectures.

dplyr::filter() subsets data row-wise using logical conditions based on variables.


Syntax

#--- syntax ---# 
dplyr::filter(dataset, condition 1, ..., condition K)

Observations where month is 4:


Observations where month is NOT 4:


Observations where month is less than 4:

This is very useful when you have many values to check.

Find the observations in June and July.


Answer

Code
dplyr::filter(flights_mini, month %in% c(6, 7))

Find the observations in January, April, July, September, and December.


Answer

Code
dplyr::filter(flights_mini, month %in% c(1, 4, 6, 9, 12))

Find the observations by carrier “US”


Answer

Code
dplyr::filter(flights_mini, carrier == "US")

You can select a subset of variables using dplyr::select().


Syntax

dplyr::select(dataset, variable 1, variable 2, ...)




If you want to drop some variables, but want to keep all the other variables, you can take advantage of operator:


Syntax

dplyr::select(dataset, - variable 1, - variable 2, ...)

Select arr_delay:


Select month, arr_delay, and dep_delay:


Deselect (remove) year and month:

Select all the columns except arr_delay


Answer

Code
dplyr::select(flights_mini, - arr_delay)

Select arr_delay and month


Answer

Code
dplyr::select(flights_mini, month, arr_delay)

You can use relocate() to change the column order.


Chosen variables come front with the order of the rest of the variable unchanged.

You can use mutate() to create a new variable (or overwrite the existing one) in the dataset:


Syntax

dplyr::mutate(data, new variable name = expression)

You can define multiple variables within a single mutate() function.

You can create a new variable based on the variables that have been just created within the same mutate() function.

You can apply functions to variables when creating new variables:


Note

The function you apply has to accept a vector (a variable column).

Sometimes, you want to to alter the values of a variable for specific rows that satisfy certain conditions.

Suppose you found out that dep_time for all the flights from JFK was misreported so that dep_time is 10 minutes earlier than the true departure times.

So, we would like to add 10 minutes to all the flights by JFK.

You can use ifelse() like this:

Suppose you want to label flights with arr_delay > 0 to be time-loss and time-gain otherwise:

You can use ifelse() for defining a dichotomous variable like this:

The case_when() function is useful if you have more than two cases.

Syntax:

case_when(
  condition 1 ~ value to assign,
  condition 2 ~ value to assign,
  condition 3 ~ value to assign,
  ...
)

Example:

Find the mean value of arr_delay in April and May (combined) and define it as a new variable named avg_arr_delay


Code
flights_mini %>%
  filter(month %in% c(4, 5)) %>%
  mutate(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  relocate(avg_arr_delay)

Find the sum of dep_delay in January, February, and December (combined) and define it as a new variable named sum_dep_delay, and then move the variable to the first column of the dataset.


Code
flights %>%
  dplyr::filter(month %in% c(1, 2, 12)) %>%
  dplyr::mutate(sum_dep_delay = sum(dep_delay, na.rm = TRUE)) %>%
  dplyr::relocate(sum_dep_delay)

You can rename variables using dplyr::rename().


Syntax

rename(data, new variable name = old variable name, ...)


Example

Renamed:

Original:

You can use arrange() to reorder rows based on the value of variables.


Syntax

#--- Syntax (NOT RUN) ---#
arrange(flights_mini, variable name)


The default is the ascending order.

To arrange in the descending order, you use desc() function:

Notice that the original data flights_mini was not affected by the dplyr::filter() operations in the previous slides.

This is consistent across all the verbs in dplyr. Whatever actions you take, the original data is unaltered.

To use the transformed data for later use, you need to assign it to a new object (or overwrite if that is okay):

Grouped Operations

Group-wise operations, such as the mean of arrival delay by carrier are very useful to gain an insight into differences across groups.

The group_by() function in conjunction with summarize() function does exactly that.


Syntax

#--- group by variables ---#
group_by(dataset, variable to group by, variable to group by, ...)  

#--- summarize ---#
summarize(grouped dataset, expression)  

You first use group_by() to set the group for a dataset:

There are no apparent differences in flights_carrier from flights_slim. The only thing you did by group_by() was to tell R that whatever we will do to the new dataset is going to be done by group, which is reflected in “Groups” of the printed data.

Once the group is set, we are ready to do some group-wise (by carrier) operations. Let’s now find the means of arr_delay by carrier so we know which carriers perform better than others. We can do so using summarize().

Using the piping operator,

You can apply any functions that work on a vector (a variable)

You can assign the results of the grouped operations to new variables using mutate()

Using flights, find the carrier that had the longest average departure delay (dep_delay) during May through August.

For the rest of the exercises, we are going to use the weather data for the three airports in NY. First load the weather data and get familiar with the data set.

Find the daily mean temperature (temp), humidity (humid), wind speed (wind_speed), and precipitation (precip) by the origin of departure. Do not forget to name each daily weather variable.

Code
(
daily_weather <- 
  weather %>%
  dplyr::group_by(origin, month, day) %>%
  dplyr::summarize(
    temp = mean(temp),
    humid = mean(humid),
    wind_speed = mean(wind_speed),
    precip = mean(precip)
  )
) 

Subset the daily weather data you obtained in exercise 2 so that it contains weather information only in Nov, Dec, Jan, and Feb for flights that depart from “EWR”.

Code
daily_weather %>%
  dplyr::filter(month %in% c(11, 12, 1, 2)) %>%
  dplyr::filter(origin == "EWR")

Extensions (may be helpful occasionally)


across()

  • across() lets you apply the same operation (function) to multiple columns at the same time.

  • It is used with summarize() and mutate()

  • It is useful when you want to apply the same function to many variables.

Syntax

across(which variables to apply the function, function, options to the function)

Examples

Note: na.rm = TRUE is an option for the mean() function.

Without across()


With across()

across() with pattern functions


You can use across() (and other dplyr functions) in conjunction with functions that identify variables that fit with the pattern you specify:

  • is.numeric()
  • is.character()
  • starts_with()
  • ends_with()


Applying the same function to all numeric variables using where(is.numeric):


Applying the same function to all character variables using where(is.character):

  • n_distinct() identifies the number of unique values.


Applying the same function to all the variables that end with delay:

  • Use starts_with() to apply the same function to all the variables that star with a particular sting of characters:


You can use logical operators:


Mixing explicitly naming variables and using functions:

select() and relocate()

The approaches to refer to the variables that fit with a specified pattern can be applied to other dplyr operations as well:

Try: