A-1: data.table package for data wrangling

Tips to make the most of the lecture notes

  • Click on the three horizontally stacked lines at the bottom left corner of the slide, then you will see table of contents, and you can jump to the section you want

  • Hit letter “o” on your keyboard and you will have a panel view of all the slides

  • The box area with a hint of blue as the background color is where you can write code (hereafter referred to as the “code area”).
  • Hit the “Run Code” button to execute all the code inside the code area.
  • You can evaluate (run) code selectively by highlighting the parts you want to run and hitting Command + Enter for Mac (Ctrl + Enter for Windows).
  • If you want to run the codes on your computer, you can first click on the icon with two sheets of paper stacked on top of each other (top right corner of the code chunk), which copies the code in the code area. You can then paste it onto your computer.
  • You can click on the reload button (top right corner of the code chunk, left to the copy button) to revert back to the original code.

Learning objectives

The objectives of this chapter is to learn how to use the data.table package to

  • manipulate data
  • reshape a dataset
  • merge multiple datasets

data.table package

  • The data.table package is a popular alternative to dplyr that is much faster than dplyr for most data operations particularly when the dataset is large.
    • See here for the speed comparison of dplyr and data.table.
    • This website compares dply vs data.table side by side.
  • data.table has its own class of “data.frame” called data.table

Package

Install the package if you have not and library it.

#--- install ---# 
install.packages("data.table")

#--- library ---#
library(data.table)

Dataset

We use the following weather dataset.

Here is the general form of data.table operation.

data.table[i, j, by]
  • i: specify which rows (like dplyr::filter)
  • j: specify the operations on selected columns
  • by: specify the variable to be used as groups by which operations specified in j are implemented

You can implement all the three main actions in a single statement unlike dplyr:

For example, the following set of codes below will give the same results (the number of flights by American Airline by origin-month):

data.table way:


dplyr way :

Key actions

You can filter rows that satisfy certain conditions like below:

  • i: origin == "JFK" & month == 6L
  • j: no action (all columns)
  • by: non

single column as a vector

  • i: no action (all rows)
  • j: get arr_delay itself as it is
  • by: non

single column as a data.table

. here is a short hand for list

multiple column as a data.table


select and rename multiple column as a data.table

Note

As long as j-expression returns a list, each element of the list will be converted to a column in the resulting data.table.

Select variables (like data.frame)

Another way to select particular columns is to provide a concatenated list of variable names in double quotes (just like a data.frame):


Dropping variables

You can drop variables by using - or ! in front of the list of the variables to drop:

An operation in j


Row-wise subset and operation in j at the same time

Remember the rule? .(m_arr = mean(arr_delay), m_dep = mean(dep_delay)) is a list, so the output is a data.table.


dplyr way:

Create the following dataset we are going to use in this slide:


flights_mini has two observations per month.

In data.table, you use := to create a new variable instead of =.

Important

  • := operator updates data.table columns in-place (by reference), meaning the original data is altered.
  • Evaluate flights and confirm that speed is indeed in flights_mini.
  • This holds true for any operations involving :=.
  • This is different from dply.r::mutate() which does not alter the original dataset

If you have a reason to not wanting the original data to be altered after := operations. You can create a deep copy of the dataset using data.table::copy() function.

The object created by copy() is independent of the original dataset in the sense that actions on one of them do not affect the other.

Here are how you define multiple variables at the same time.

Multiple variable 1

The results of the nth expression is assigned to nth variable name on the left.


Multiple variable 2

Confirm that flights_mini was updated to have the new variables defined just above.

You can update column values for some rows that satisfy certain conditions by using logical evaluations in i and := in j.

Example

You can calculate grouped summary and assign the values to a variable by grouping in by and := with summary expressions in j.

Of course, all the rows in the same month will have the same value (mean of the arr_delay of the group).

Grouped operations

The number of flights by origin.

  • i: no action (all rows)
  • j: the number of observations
  • by: group by origin

Note: .N is a special symbol from the data.table package that means .red[the number of observations].

The number of flights by origin and month.

The number of flights by origin and month for carrier == "AA"


dplyr way :

Other useful operations and tips

You can use order() from the base package to sort a data.table.

Sorting is about shuffling rows, so you will be working on i.

Syntax

#--- NOT RUN ---#
data.table[order(variable 1, vairable 2, ..), ]  


Example

You can use setnames() to rename variables.

Note: setnames() is one of the data.table operations that updates the dataset in-place (by reference) just like :=.

The data.table::shift() function can move up or down a variable.


By group

duplicated() checks whether each of the observations have other observations that are identical in values of the user-specified variables, and returns a TRUE/FALSE vector of length equal to the number of rows of the data.

The following code checks if there are any other flights that fly on the same hour of the same day.

unique() does the opposite of duplicated(). After applying unique(), you will be left with only the observations that are unique in all of the variables specified by the user (There will be only one observation that has the same values in all the user-specified variables).

fcase() is like case_when() in dplyr.

fcase(
  condition 1, value 1,  
  condition 2, value 2,  
  condition 3, value 3,  
  .
  .
  .
) 

Example

Note: for this example, we could have just used fifelse() as the created variable is dichotomous.

You can use %>% to chain piped operations just like dplyr using . to refer to the data.table generated through the preceding actions.

Example:

flights[, .(.N), by = .(origin, month)] %>% 
  .[i, j, by] %>% 
  .[i, j, by] %>% 
  .[i, j, by]  

.SD

.SD (which stands for Subset Data) is a special symbol that allows you to do many cool things.

Let’s create a small data.table that will help us understand what .SD does (we will come back to the code later).

Without grouping specified in by, .SD is the data.table itself. So, flights_mini[, .SD] is the same as flights_mini

But, when grouped, it becomes the subset (grouped) of the data.table.

Note that .SD contains all the columns except the grouping columns by default.

Apply the same function across all the variables:


Of course, you could get the same results by this, but the output is a list, not a data.table.


dplyr way:

Apply the same function across all the variables by group:


dplyr way:

Instead of let .SD contain all the columns, you can use .SDcols to pick variables to be included in .SD after by,.

Example


You cannot use .(variable name 1, variable name 1, ...) for .SDcols. This would fail:

A very important use case of .SD is identifying the observation with the maximum (or minimum) value of a variable by group.

Suppose you are trying to identify the flight that had the longest arrival delay by month-carrier.

Remember that .SD is a list of data.tables grouped by carrier and month. .SD[which.max(arr_dealy), ] will find the row where the arr_delay is the highest by group (month-carrier).

Reshaping datasets

Create the following datasets in the long format:

You can use dcast() function to make a long dataset wide:

Syntax

#--- NOT RUN ---#  
dcast(data.table, A ~ B, value.var = C)

Using the pivot_wider() language,

  • B is equivalent to variables you specify for names_from
  • C is equivalent to variables you specify for values_from
  • A are all the variables except B and C.


Example

Original long data:

Wide data:

You can use melt() function to make a wide dataset long:

Syntax

melt(data.table, id.vars = "state")  
  • id.vars are the variables except those that will be melt into long format


Example

Original wide data:

Long data:

Reshaping datasets: multiple columns

Long data

This data has multiple rows to be spread: yield and rainfall.

Wide data

This data has multiple sets of columns to be melted: yield_* and rainfall_*.

It is easy to cast multiple variables to make a long data wide. You just need to give a list of variable names to the value.var option.

It is not as simple to make a wide data with multiple sets of columns to long.

You can provide a list of sets of variables names to melt() to tell R which variables are belong to the same group using the measure() option.

Note however that year information from the variable names are lost. In the resulting dataset, variable == 1 and variable == 2 correspond to 2019 and 2020, respectively. So, you need an additional step to recover the original long data format.

Alternatively, it is probably better to follow the multi-step approach we took we used when we use pivot_*() in dplyr.

The strategy here is to

  • first make the data long ignoring the fact we want yield_* and rainfall_* to be separate variables eventually
  • split variable names into two: variable meaning and year (done by tstrsplit())
  • and then use dcast() to make it wider

Merging datasets

You can use the following syntax:

dt1[dt2, on = .(list of variables)]


This is the same as:

left_join(dt2, dt1, by = c(list of variables)) 


So, dt2 is the base dataset and you are attaching dt1 to dt2.

Let’s first create a price data:


Now merge:

There is nothing to prevent you from using dplyr::left_join().


Remember, data.table is also a data.frame. So, any function that works with data.frame works for data.table as well.