data.table package for data wranglingClick on the three horizontally stacked lines at the bottom left corner of the slide, then you will see table of contents, and you can jump to the section you want
Hit letter “o” on your keyboard and you will have a panel view of all the slides
The objectives of this chapter is to learn how to use the data.table package to
data.table packagedata.table package is a popular alternative to dplyr that is much faster than dplyr for most data operations particularly when the dataset is large.
dplyr and data.table.dply vs data.table side by side.data.table has its own class of “data.frame” called data.tablePackage
Install the package if you have not and library it.
Dataset
We use the following weather dataset.
Here is the general form of data.table operation.
i: specify which rows (like dplyr::filter)j: specify the operations on selected columnsby: specify the variable to be used as groups by which operations specified in j are implementedYou can implement all the three main actions in a single statement unlike dplyr:
For example, the following set of codes below will give the same results (the number of flights by American Airline by origin-month):
data.table way:
dplyr way :
You can filter rows that satisfy certain conditions like below:
i: origin == "JFK" & month == 6Lj: no action (all columns)by: nonsingle column as a vector
i: no action (all rows)j: get arr_delay itself as it isby: nonsingle column as a data.table
. here is a short hand for list
multiple column as a data.table
select and rename multiple column as a data.table
Note
As long as j-expression returns a list, each element of the list will be converted to a column in the resulting data.table.
Select variables (like data.frame)
Another way to select particular columns is to provide a concatenated list of variable names in double quotes (just like a data.frame):
Dropping variables
You can drop variables by using - or ! in front of the list of the variables to drop:
An operation in j
Row-wise subset and operation in j at the same time
Remember the rule? .(m_arr = mean(arr_delay), m_dep = mean(dep_delay)) is a list, so the output is a data.table.
dplyr way:
Create the following dataset we are going to use in this slide:
flights_mini has two observations per month.
In data.table, you use := to create a new variable instead of =.
Important
:= operator updates data.table columns in-place (by reference), meaning the original data is altered.flights and confirm that speed is indeed in flights_mini.:=.dply.r::mutate() which does not alter the original datasetIf you have a reason to not wanting the original data to be altered after := operations. You can create a deep copy of the dataset using data.table::copy() function.
The object created by copy() is independent of the original dataset in the sense that actions on one of them do not affect the other.
Here are how you define multiple variables at the same time.
Multiple variable 1
The results of the nth expression is assigned to nth variable name on the left.
Multiple variable 2
Confirm that flights_mini was updated to have the new variables defined just above.
You can update column values for some rows that satisfy certain conditions by using logical evaluations in i and := in j.
Example
You can calculate grouped summary and assign the values to a variable by grouping in by and := with summary expressions in j.
Of course, all the rows in the same month will have the same value (mean of the arr_delay of the group).
The number of flights by origin.
i: no action (all rows)j: the number of observationsby: group by originNote: .N is a special symbol from the data.table package that means .red[the number of observations].
The number of flights by origin and month.
The number of flights by origin and month for carrier == "AA"
dplyr way :
You can use order() from the base package to sort a data.table.
Sorting is about shuffling rows, so you will be working on i.
Syntax
Example
You can use setnames() to rename variables.
Note: setnames() is one of the data.table operations that updates the dataset in-place (by reference) just like :=.
The data.table::shift() function can move up or down a variable.
By group
duplicated() checks whether each of the observations have other observations that are identical in values of the user-specified variables, and returns a TRUE/FALSE vector of length equal to the number of rows of the data.
The following code checks if there are any other flights that fly on the same hour of the same day.
unique() does the opposite of duplicated(). After applying unique(), you will be left with only the observations that are unique in all of the variables specified by the user (There will be only one observation that has the same values in all the user-specified variables).
fcase() is like case_when() in dplyr.
Example
Note: for this example, we could have just used fifelse() as the created variable is dichotomous.
.SD (which stands for Subset Data) is a special symbol that allows you to do many cool things.
Let’s create a small data.table that will help us understand what .SD does (we will come back to the code later).
Without grouping specified in by, .SD is the data.table itself. So, flights_mini[, .SD] is the same as flights_mini
But, when grouped, it becomes the subset (grouped) of the data.table.
Note that .SD contains all the columns except the grouping columns by default.
Apply the same function across all the variables:
Of course, you could get the same results by this, but the output is a list, not a data.table.
dplyr way:
Apply the same function across all the variables by group:
dplyr way:
Instead of let .SD contain all the columns, you can use .SDcols to pick variables to be included in .SD after by,.
Example
You cannot use .(variable name 1, variable name 1, ...) for .SDcols. This would fail:
A very important use case of .SD is identifying the observation with the maximum (or minimum) value of a variable by group.
Suppose you are trying to identify the flight that had the longest arrival delay by month-carrier.
Remember that .SD is a list of data.tables grouped by carrier and month. .SD[which.max(arr_dealy), ] will find the row where the arr_delay is the highest by group (month-carrier).
Create the following datasets in the long format:
You can use dcast() function to make a long dataset wide:
Syntax
Using the pivot_wider() language,
B is equivalent to variables you specify for names_fromC is equivalent to variables you specify for values_fromExample
Original long data:
Wide data:
Long data
This data has multiple rows to be spread: yield and rainfall.
Wide data
This data has multiple sets of columns to be melted: yield_* and rainfall_*.
It is easy to cast multiple variables to make a long data wide. You just need to give a list of variable names to the value.var option.
It is not as simple to make a wide data with multiple sets of columns to long.
You can provide a list of sets of variables names to melt() to tell R which variables are belong to the same group using the measure() option.
Note however that year information from the variable names are lost. In the resulting dataset, variable == 1 and variable == 2 correspond to 2019 and 2020, respectively. So, you need an additional step to recover the original long data format.
Alternatively, it is probably better to follow the multi-step approach we took we used when we use pivot_*() in dplyr.
The strategy here is to
yield_* and rainfall_* to be separate variables eventuallytstrsplit())dcast() to make it widerYou can use the following syntax:
This is the same as:
So, dt2 is the base dataset and you are attaching dt1 to dt2.
Let’s first create a price data:
Now merge:
There is nothing to prevent you from using dplyr::left_join().
Remember, data.table is also a data.frame. So, any function that works with data.frame works for data.table as well.