Learn dplyr
basics for data wrangling.
Click on the three horizontally stacked lines at the bottom left corner of the slide, then you will see table of contents, and you can jump to the section you want
Hit letter “o” on your keyboard and you will have a panel view of all the slides
We use flights
, which is from the nycflights13
package.
We also use flights_mini
, which is created by running the code below.
You do not have to understand the code just yet. You will, once you have completed the lecture. Just inspect the data and familiarize yourself with it.
%>%
Let f()
be a function and x
is an R object that f()
accepts. Then,
x %>% f()
is the same as f(x)
Try the following codes and confirm they return the same results:
Note
The piping operator %>%
is actually from the magrittr
package. But, when you load the dplyr
package, it is automatically loaded as well.
Suppose you have more than one arguments to the function like this:
Then,
is equivalent to
Important
That is, in general, an R object that precedes the piping operator (%>%
) becomes the first argument of the function that comes after the piping operator.
What if the object before the piping operator is not the first argument of the subsequent function?
This does not work:
because the above is equivalent to
You can refer to the preceding object by .
like this:
You can keep piping like this:
Important
The object created by all the codes preceding the piping operator is passed down to the function after the piping operator.
For example, relocate(dep_time)
receives as its first argument the outcome of the evaluation of the highlighted parts of the code below.
Consider the following sequence of actions:
Notice that you generated two intermediate datasets (a1
and a2
) to obtain the dataset you wanted (a3
).
These intermediate objects are generated only for the purpose of generating the final dataset.
It is easy to imagine that you will soon have lots of unnecessary intermediate objects on R.
Alternatively, you can do the following:
This does not create any intermediate objects unlike the first example.
However, it can be difficult to understand the code because the order of execution is the reverse of the order in which the functions are written when you read the code from left to right.
Taking advantage of the piping operator,
dplyr
packageThe dplyr
packages provides useful functions to transform data.
filter()
: select rows that satisfy user-specified conditionsselect()
: keep (remove) only the variables the user specifiedmutate()
: create (over-write) a variable based on user-specified formularename()
: rename variablesarrange()
: sort by variables specified by the userNote
There are other useful functions. But, we limit our attention to only the above as they are sufficient to keep up with the main lectures.
dplyr::filter()
subsets data row-wise using logical conditions based on variables.
Syntax
Observations where month
is 4:
Observations where month
is NOT 4:
Observations where month
is less than 4:
This is very useful when you have many values to check.
Find the observations in January, April, July, September, and December.
Answer
You can select a subset of variables using dplyr::select()
.
Syntax
If you want to drop some variables, but want to keep all the other variables, you can take advantage of −
operator:
Syntax
Select arr_delay
:
Select month
, arr_delay
, and dep_delay
:
Deselect (remove) year
and month
:
You can use mutate()
to create a new variable (or overwrite the existing one) in the dataset:
Syntax
You can define multiple variables within a single mutate()
function.
You can create a new variable based on the variables that have been just created within the same mutate()
function.
You can apply functions to variables when creating new variables:
Note
The function you apply has to accept a vector (a variable column).
Sometimes, you want to to alter the values of a variable for specific rows that satisfy certain conditions.
Suppose you found out that dep_time
for all the flights from JFK
was misreported so that dep_time
is 10 minutes earlier than the true departure times.
So, we would like to add 10 minutes to all the flights by JFK
.
You can use ifelse()
like this:
Suppose you want to label flights with arr_delay > 0
to be time-loss
and time-gain
otherwise:
You can use ifelse()
for defining a dichotomous variable like this:
Find the mean value of arr_delay
in April and May (combined) and define it as a new variable named avg_arr_delay
Answer
Find the sum of dep_delay
in January, February, and December (combined) and define it as a new variable named sum_arr_delay
, and then move the variable to the first column of the dataset.
Answer
You can rename variables using dplyr::rename()
.
Syntax
Example
Renamed:
Original:
You can use arrange()
to reorder rows based on the value of variables.
Syntax
The default is the ascending order.
To arrange in the descending order, you use desc()
function:
Notice that the original data flights_mini
was not affected by the dplyr::filter()
operations in the previous slides.
This is consistent across all the verbs in dplyr
. Whatever actions you take, the original data is unaltered.
To use the transformed data for later use, you need to assign it to a new object (or overwrite if that is okay):
Group-wise operations, such as the mean of arrival delay by carrier are very useful to gain an insight into differences across groups.
The group_by()
function in conjunction with summarize()
function does exactly that.
Syntax
You first use group_by()
to set the group for a dataset:
There are no apparent differences in flights_carrier
from flights_slim
. The only thing you did by group_by()
was to tell R that whatever we will do to the new dataset is going to be done by group, which is reflected in “Groups” of the printed data.
Once the group is set, we are ready to do some group-wise (by carrier) operations. Let’s now find the means of arr_delay
by carrier so we know which carriers perform better than others. We can do so using summarize()
.
Using the piping operator,
You can apply any functions that work on a vector (a variable)
You can assign the results of the grouped operations to new variables using mutate()
Find the carrier that had the longest average delay during May through August. Below is the output you will see if you get it right.
For the rest of the exercises, we are going to use the weather data for the three airports in NY. First load the weather data and get familiar with the data set.
Find the daily mean temperature (temp), humidity (humid), wind speed (wind_speed), and precipitation (precip) by the origin of departure. Do not forget to name each daily weather variable. Below is the output you will see if you get it right.
Answer
Subset the daily weather data you obtained in exercise 2 so that it contains weather information only in Nov, Dec, Jan, and Feb for flights that depart from “EWR”. Below is the output you will see if you get it right.
Answer