A.2 Parallelization of embarrassingly parallel processes

Parallelization of computation involves distributing the task at hand to multiple cores so that multiple processes are done in parallel. Here, we learn how to parallelize computation in R. Our focus is on the so called embarrassingly parallel processes. Embarrassingly parallel processes refer to a collection of processes where each process is completely independent of any another. That is, one process does not use the outputs of any of the other processes. The example of integer squaring is embarrassingly parallel. In order to calculate \(1^2\), you do not need to use the result of \(2^2\) or any other squares. Embarrassingly parallel processes are very easy to parallelize because you do not have to worry about which process to complete first to make other processes happen. Fortunately, most of the processes you are interested in parallelizing fall under this category104.

We will use the future_lapply() function from the future.apply package for parallelization105. Using the package, parallelization is a piece of cake as it is basically the same syntactically as lapply().

#--- load packages ---#
library(future.apply) 

You can find out how many cores you have available for parallel computation on your computer using the detectCores() function from the parallel package.

library(parallel)

#--- number of all cores ---#
detectCores()
[1] 16

Before we implement parallelized lapply(), we need to declare what backend process we will be using by plan(). Here, we use plan(multiprocess)106. In the plan() function, we can specify the number of workers. Here I will use the total number of cores less 1107.

plan(multiprocess, workers = detectCores() - 1)

future_lapply() works exactly like lapply().

sq_ls <- future_lapply(1:1000, function(x) x^2)

This is it. The only difference you see from the serialized processing using lapply() is that you changed the function name to future_lapply().

Okay, now we know how we parallelize computation. Let’s check how much improvement in implementation time we got by parallelization.

microbenchmark(
  #--- parallelized ---#
  "parallelized" = { sq_ls <- future_lapply(1:1000, function(x) x^2) }, 
  #--- non-parallelized ---#
  "not parallelized" = { sq_ls <- lapply(1:1000, function(x) x^2) },
  times = 100, 
  unit = "ms"
)
Unit: milliseconds
             expr       min         lq       mean    median         uq
     parallelized 74.337087 88.5161295 99.5311075 97.196593 108.210706
 not parallelized  0.411587  0.4805975  0.7229046  0.536682   0.683037
       max neval
 136.86259   100
  11.07314   100

Hmmmm, okay, so parallelization made the code slower… How could this be? This is because communicating jobs to each core takes some time as well. So, if each of the iterative processes is super fast (like this example where you just square a number), the time spent on communicating with the cores outweighs the time saving due to parallel computation. Parallelization is more beneficial when each of the repetitive processes takes long.

One of the very good use cases of parallelization is MC simulation. The following MC simulation tests whether the correlation between an independent variable and error term would cause bias (yes, we know the answer). The MC_sim function first generates a dataset (50,000 observations) according to the following data generating process:

\[ y = 1 + x + v \]

where \(\mu \sim N(0,1)\), \(x \sim N(0,1) + \mu\), and \(v \sim N(0,1) + \mu\). The \(\mu\) term cause correlation between \(x\) (the covariate) and \(v\) (the error term). It then estimates the coefficient on \(x\) vis OLS, and return the estimate. We would like to repeat this process 1,000 times to understand the property of the OLS estimators under the data generating process. This Monte Carlo simulation is embarrassingly parallel because each process is independent of any other.

#--- repeat steps 1-3 B times ---#
MC_sim <- function(i){

  N <- 50000 # sample size

  #--- steps 1 and 2:  ---#
  mu <- rnorm(N) # the common term shared by both x and u
  x <- rnorm(N) + mu # independent variable
  v <- rnorm(N) + mu # error
  y <- 1 + x + v # dependent variable
  data <- data.table(y = y, x = x)

  #--- OLS ---# 
  reg <- lm(y~x, data = data) # OLS

  #--- return the coef ---#
  return(reg$coef['x'])
}

Let’s run one iteration,

tic()
MC_sim(1)
toc()
       x 
1.503353 
elapsed 
  0.023 

Okay, so it takes 0.023 second for one iteration. Now, let’s run this 1000 times with or without parallelization.

Not parallelized

#--- non-parallel ---#
tic()
MC_results <- lapply(1:1000, MC_sim)
toc()
elapsed 
 22.987 

Parallelized

#--- parallel ---#
tic()
MC_results <- future_lapply(1:1000, MC_sim)
toc()
elapsed 
  4.376 

As you can see, parallelization makes it much quicker with a noticeable difference in elapsed time. We made the code 5.25 times faster. However, we did not make the process 15 times faster even though we used 15 cores for the parallelized process. This is because of the overhead associated with distributing tasks to the cores. The relative advantage of parallelization would be greater if each iteration took more time. For example, if you are running a process that takes about 2 minutes for 1000 times, it would take approximately 33 hours and 20 minutes. But, it may take only 4 hours if you parallelize it on 15 cores, or maybe even 2 hours if you run it on 30 cores.

A.2.1 Mac or Linux users

For Mac users, parallel::mclapply() is just as compelling (or pbmclapply::pbmclapply() if you want to have a nice progress report, which is very helpful particularly when the process is long). It is just as easy to use as future_lapply() because its syntax is the same as lapply(). You can control the number of cores to employ by adding mc.cores option. Here is an example code that does the same MC simulations we conducted above:

#--- mclapply ---#
library(parallel)
MC_results <- mclapply(1:1000, MC_sim, mc.cores = detectCores() - 1)

#--- or with progress bar ---#
library(pbmclapply)
MC_results <- pbmclapply(1:1000, MC_sim, mc.cores = detectCores() - 1)

  1. A good example of non-embarrassingly parallel process is dynamic optimization via backward induction. You need to know the optimal solution at \(t = T\), before you find the optimal solution at \(t = T-1\).↩︎

  2. There are many other options including the parallel, foreach packages.↩︎

  3. If you are a Mac or Linux user, then the multicore process will be used, while the multisession process will be used if you are using a Windows machine. The multicore process is superior to the multisession process. See this lecture note on parallel programming using R by Dr. Grant McDermott’s at the University of Oregon for the distinctions between the two and many other useful concepts for parallelization. At the time of this writing, if you run R through RStudio, multiprocess option is always redirected to multisession option because of the instability in doing multiprocess. If you use Linux or Mac and want to take the full advantage of future_apply, you should not run your R programs through RStudio at least for now.↩︎

  4. This way, you can have one more core available to do other tasks comfortably. However, if you don’t mind having your computer completely devoted to the processing task at hand, then there is no reason not to use all the cores.↩︎