A primer of the dplyr package

Before you start


Learning objectives

Learn dplyr basics for data wrangling.


Tips to make the most of the lecture notes


Interactive navigation tools

  • Click on the three horizontally stacked lines at the bottom left corner of the slide, then you will see table of contents, and you can jump to the section you want

  • Hit letter “o” on your keyboard and you will have a panel view of all the slides


Running and writing codes

  • The box area with a hint of blue as the background color is where you can write code (hereafter referred to as the “code area”).
  • Hit the “Run Code” button to execute all the code inside the code area.
  • You can evaluate (run) code selectively by highlighting the parts you want to run and hitting Command + Enter for Mac (Ctrl + Enter for Windows).
  • If you want to run the codes on your computer, you can first click on the icon with two sheets of paper stacked on top of each other (top right corner of the code chunk), which copies the code in the code area. You can then paste it onto your computer.
  • You can click on the reload button (top right corner of the code chunk, left to the copy button) to revert back to the original code.

Data Preparation

We use flights, which is from the nycflights13 package.

We also use flights_mini, which is created by running the code below.


You do not have to understand the code just yet. You will, once you have completed the lecture. Just inspect the data and familiarize yourself with it.

Piping with %>%

Let f() be a function and x is an R object that f() accepts. Then,

x %>% f() is the same as f(x)

Try the following codes and confirm they return the same results:


Note

The piping operator %>% is actually from the magrittr package. But, when you load the dplyr package, it is automatically loaded as well.

Suppose you have more than one arguments to the function like this:

f(x1, x2, option 1, option 2)


Then,

z %>% f(x2, option 1, option 2)


is equivalent to

f(z, x2, option 1, option 2) 


Important

That is, in general, an R object that precedes the piping operator (%>%) becomes the first argument of the function that comes after the piping operator.

What if the object before the piping operator is not the first argument of the subsequent function?


This does not work:


because the above is equivalent to


You can refer to the preceding object by . like this:

You can keep piping like this:


Important

The object created by all the codes preceding the piping operator is passed down to the function after the piping operator.

For example, relocate(dep_time) receives as its first argument the outcome of the evaluation of the highlighted parts of the code below.

Consider the following sequence of actions:


  • Notice that you generated two intermediate datasets (a1 and a2) to obtain the dataset you wanted (a3).

  • These intermediate objects are generated only for the purpose of generating the final dataset.

  • It is easy to imagine that you will soon have lots of unnecessary intermediate objects on R.

Alternatively, you can do the following:


  • This does not create any intermediate objects unlike the first example.

  • However, it can be difficult to understand the code because the order of execution is the reverse of the order in which the functions are written when you read the code from left to right.

Taking advantage of the piping operator,


  • This is much easier to read as the order of execution is the same as the order in which the functions are written.

Data wrangling with the dplyr package

The dplyr packages provides useful functions to transform data.

  • filter(): select rows that satisfy user-specified conditions
  • select(): keep (remove) only the variables the user specified
  • mutate(): create (over-write) a variable based on user-specified formula
  • rename(): rename variables
  • arrange(): sort by variables specified by the user

Note

There are other useful functions. But, we limit our attention to only the above as they are sufficient to keep up with the main lectures.

dplyr::filter() subsets data row-wise using logical conditions based on variables.


Syntax

#--- syntax ---# 
dplyr::filter(dataset, condition 1, ..., condition K)

Observations where month is 4:


Observations where month is NOT 4:


Observations where month is less than 4:

This is very useful when you have many values to check.

Find the observations in June and July.


Answer

Code
dplyr::filter(flights_mini, month %in% c(6, 7))

Find the observations in January, April, July, September, and December.


Answer

Code
dplyr::filter(flights_mini, month %in% c(1, 4, 6, 9, 12))

Find the observations by carrier “US”


Answer

Code
dplyr::filter(flights_mini, carrier == "US")

You can select a subset of variables using dplyr::select().


Syntax

dplyr::select(dataset, variable 1, variable 2, ...)




If you want to drop some variables, but want to keep all the other variables, you can take advantage of operator:


Syntax

dplyr::select(dataset, - variable 1, - variable 2, ...)

Select arr_delay:


Select month, arr_delay, and dep_delay:


Deselect (remove) year and month:

Select all the columns except arr_delay


Answer

Code
dplyr::select(flights_mini, - arr_delay)

Select arr_delay and month


Answer

Code
dplyr::select(flights_mini, month, arr_delay)

You can use mutate() to create a new variable (or overwrite the existing one) in the dataset:


Syntax

dplyr::mutate(data, new variable name = expression)

You can define multiple variables within a single mutate() function.

You can create a new variable based on the variables that have been just created within the same mutate() function.

You can apply functions to variables when creating new variables:


Note

The function you apply has to accept a vector (a variable column).

Sometimes, you want to to alter the values of a variable for specific rows that satisfy certain conditions.

Suppose you found out that dep_time for all the flights from JFK was misreported so that dep_time is 10 minutes earlier than the true departure times.

So, we would like to add 10 minutes to all the flights by JFK.

You can use ifelse() like this:

Suppose you want to label flights with arr_delay > 0 to be time-loss and time-gain otherwise:

You can use ifelse() for defining a dichotomous variable like this:

The case_when() function is useful if you have more than two cases.

Syntax:

case_when(
  condition 1 ~ value to assign,
  condition 2 ~ value to assign,
  condition 3 ~ value to assign,
  ...
)

Example:

Find the mean value of arr_delay in April and May (combined) and define it as a new variable named avg_arr_delay


Answer

Find the sum of dep_delay in January, February, and December (combined) and define it as a new variable named sum_arr_delay, and then move the variable to the first column of the dataset.


Answer

You can rename variables using dplyr::rename().


Syntax

rename(data, new variable name = old variable name, ...)


Example

Renamed:

Original:

You can use arrange() to reorder rows based on the value of variables.


Syntax

#--- Syntax (NOT RUN) ---#
arrange(flights_mini, variable name)


The default is the ascending order.

To arrange in the descending order, you use desc() function:

Notice that the original data flights_mini was not affected by the dplyr::filter() operations in the previous slides.

This is consistent across all the verbs in dplyr. Whatever actions you take, the original data is unaltered.

To use the transformed data for later use, you need to assign it to a new object (or overwrite if that is okay):

Grouped Operations

Group-wise operations, such as the mean of arrival delay by carrier are very useful to gain an insight into differences across groups.

The group_by() function in conjunction with summarize() function does exactly that.


Syntax

#--- group by variables ---#
group_by(dataset, variable to group by, variable to group by, ...)  

#--- summarize ---#
summarize(grouped dataset, expression)  

You first use group_by() to set the group for a dataset:

There are no apparent differences in flights_carrier from flights_slim. The only thing you did by group_by() was to tell R that whatever we will do to the new dataset is going to be done by group, which is reflected in “Groups” of the printed data.

Once the group is set, we are ready to do some group-wise (by carrier) operations. Let’s now find the means of arr_delay by carrier so we know which carriers perform better than others. We can do so using summarize().

Using the piping operator,

You can apply any functions that work on a vector (a variable)

You can assign the results of the grouped operations to new variables using mutate()

Find the carrier that had the longest average delay during May through August. Below is the output you will see if you get it right.

For the rest of the exercises, we are going to use the weather data for the three airports in NY. First load the weather data and get familiar with the data set.

Find the daily mean temperature (temp), humidity (humid), wind speed (wind_speed), and precipitation (precip) by the origin of departure. Do not forget to name each daily weather variable. Below is the output you will see if you get it right.


Answer

Code
daily_weather <- 
  weather %>%
  dplyr::group_by(origin, month, day) %>%
  dplyr::summarize(
    temp = mean(temp),
    humid = mean(humid),
    wind_speed = mean(wind_speed),
    precip = mean(precip)
  )

Subset the daily weather data you obtained in exercise 2 so that it contains weather information only in Nov, Dec, Jan, and Feb for flights that depart from “EWR”. Below is the output you will see if you get it right.


Answer

Code
(
daily_weather %>%
  dplyr::filter(month %in% c(11, 12, 1, 2)) %>%
  dplyr::filter(origin == "EWR")
)