A Loop and Parallel Computing
Before you start
Here we will learn how to program repetitive operations effectively and fast. We start from the basics of a loop for those who are not familiar with the concept. We then cover parallel computation using the future.lapply
and parallel
package. Those who are familiar with lapply()
can go straight to Chapter A.2.
Here are the specific learning objectives of this chapter.
- Learn how to use for loop and
lapply()
to complete repetitive jobs - Learn how not to loop things that can be easily vectorized
- Learn how to parallelize repetitive jobs using the
future_lapply()
function from thefuture.apply
package
Packages to install and load
Run the following code to install or load (if already installed) the pacman
package, and then install or load (if already installed) the listed package inside the pacman::p_load()
function.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
dplyr, # data wrangling
data.table # data wrangling
)
There are other packages that will be loaded during the demonstration.
A.1 Repetitive processes and looping
A.1.1 What is looping?
We sometimes need to run the same process over and over again often with slight changes in parameters. In such a case, it is very time-consuming and messy to write all of the steps one bye one. For example, suppose you are interested in knowing the square of 1 through 5 (\([1, 2, 3, 4, 5]\)). The following code certainly works:
1^2
[1] 1
2^2
[1] 4
3^2
[1] 9
4^2
[1] 16
5^2
[1] 25
However, imagine you have to do this for 1000 integers. Yes, you don’t want to write each one of them one by one as that would occupy 1000 lines of your code, and it would be time-consuming. Things will be even worse if you need to repeat much more complicated processes like Monte Carlo simulations. So, let’s learn how to write a program to do repetitive jobs effectively using loop.
Looping is repeatedly evaluating the same (except parameters) process over and over again. In the example above, the same process is the action of squaring. This does not change among the processes you run. What changes is what you square. Looping can help you write a concise code to implement these repetitive processes.
A.1.2 For loop
Here is how for loop works in general:
for (x in a_list_of_values){
you do what you want to do with x
}
As an example, let’s use this looping syntax to get the same results as the manual squaring of 1 through 5:
for (x in 1:5) {
print(x^2)
}
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
Here, a list of values is \(1, 2, 3, 4, 5]\). For each value in the list, you square it (x^2
) and then print it (print()
). If you want to get the square of \(1:1000\), the only thing you need to change is the list of values to loop over as in:
#--- evaluation not reported as it's too long ---#
for (x in 1:1000) {
print(x^2)
}
So, the length of the code does not depend on how many repeats you do, which is an obvious improvement over manual typing of every single process one by one. Note that you do not have to use \(x\) to refer to an object you are going to use. It could be any combination of letters as long as you use it when you code what you want to do inside the loop. So, this would work just fine,
for (bluh_bluh_bluh in 1:5) {
print(bluh_bluh_bluh^2)
}
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
A.1.3 For loop using the lapply()
function
You can do for loop using the lapply()
function as well.100 Here is how it works:
#--- NOT RUN ---#
lapply(A, B)
where \(A\) is the list of values you go through one by one in the order the values are stored, and \(B\) is the function you would like to apply to each of the values in \(A\). For example, the following code does exactly the same thing as the above for loop example.
lapply(1:5, function(x) {
x^2
})
[[1]]
[1] 1
[[2]]
[1] 4
[[3]]
[1] 9
[[4]]
[1] 16
[[5]]
[1] 25
Here, \(A\) is \([1, 2, 3, 4, 5]\). In \(B\) you have a function that takes \(x\) and square it. So, the above code applies the function to each of \([1, 2, 3, 4, 5]\) one by one. In many circumstances, you can write the same looping actions in a much more concise manner using the lapply
function than explicitly writing out the loop process as in the above for loop examples. You might have noticed that the output is a list. Yes, lapply()
returns the outcomes in a list. That is where l in lapply()
comes from.
When the operation you would like to repeat becomes complicated (almost always the case), it is advisable that you create a function of that process first.
#--- define the function first ---#
square_it <- function(x) {
return(x^2)
}
#--- lapply using the pre-defined function ---#
lapply(1:5, square_it)
[[1]]
[1] 1
[[2]]
[1] 4
[[3]]
[1] 9
[[4]]
[1] 16
[[5]]
[1] 25
Finally, it is a myth that you should always use lapply()
instead of the explicit for loop syntax because lapply()
(or other apply()
families) is faster. They are basically the same.101
A.1.4 Looping over multiple variables using lapply()
lapply()
allows you to loop over only one variable. However, it is often the case that you want to loop over multiple variables102. However, it is easy to achieve this. The trick is to create a data.frame
of the variables where the complete list of the combinations of the variables are stored, and then loop over row of the data.frame
. As an example, suppose we are interested in understanding the sensitivity of corn revenue to corn price and applied nitrogen amount. We consider the range of $3.0/bu to $5.0/bu for corn price and 0 lb/acre to 300/acre for nitrogen rate.
#--- corn price vector ---#
corn_price_vec <- seq(3, 5, by = 1)
#--- nitrogen vector ---#
nitrogen_vec <- seq(0, 300, by = 100)
After creating vectors of the parameters, you combine them to create a complete combination of the parameters using the expand.grid()
function, and then convert it to a data.frame
object103.
#--- crate a data.frame that holds parameter sets to loop over ---#
parameters_data <-
expand.grid(
corn_price = corn_price_vec,
nitrogen = nitrogen_vec
) %>%
#--- convert the matrix to a data.frame ---#
data.frame()
#--- take a look ---#
parameters_data
corn_price nitrogen
1 3 0
2 4 0
3 5 0
4 3 100
5 4 100
6 5 100
7 3 200
8 4 200
9 5 200
10 3 300
11 4 300
12 5 300
We now define a function that takes a row number, refer to parameters_data
to extract the parameters stored at the row number, and then calculate corn yield and revenue based on the extracted parameters.
gen_rev_corn <- function(i) {
#--- define corn price ---#
corn_price <- parameters_data[i, "corn_price"]
#--- define nitrogen ---#
nitrogen <- parameters_data[i, "nitrogen"]
#--- calculate yield ---#
yield <- 240 * (1 - exp(0.4 - 0.02 * nitrogen))
#--- calculate revenue ---#
revenue <- corn_price * yield
#--- combine all the information you would like to have ---#
data_to_return <- data.frame(
corn_price = corn_price,
nitrogen = nitrogen,
revenue = revenue
)
return(data_to_return)
}
This function takes \(i\) (act as a row number within the function), extract corn price and nitrogen from the \(i\)th row of parameters_mat
, which are then used to calculate yield and revenue104. Finally, it returns a data.frame
of all the information you used (the parameters and the outcomes).
#--- loop over all the parameter combinations ---#
rev_data <- lapply(1:nrow(parameters_data), gen_rev_corn)
#--- take a look ---#
rev_data
[[1]]
corn_price nitrogen revenue
1 3 0 -354.1138
[[2]]
corn_price nitrogen revenue
1 4 0 -472.1517
[[3]]
corn_price nitrogen revenue
1 5 0 -590.1896
[[4]]
corn_price nitrogen revenue
1 3 100 574.6345
[[5]]
corn_price nitrogen revenue
1 4 100 766.1793
[[6]]
corn_price nitrogen revenue
1 5 100 957.7242
[[7]]
corn_price nitrogen revenue
1 3 200 700.3269
[[8]]
corn_price nitrogen revenue
1 4 200 933.7692
[[9]]
corn_price nitrogen revenue
1 5 200 1167.212
[[10]]
corn_price nitrogen revenue
1 3 300 717.3375
[[11]]
corn_price nitrogen revenue
1 4 300 956.4501
[[12]]
corn_price nitrogen revenue
1 5 300 1195.563
Successful! Now, for us to use the outcome for other purposes like further analysis and visualization, we would need to have all the results combined into a single data.frame
instead of a list of data.frame
s. To do this, use either bind_rows()
from the dplyr
package or rbindlist()
from the data.table
package.
#--- bind_rows ---#
bind_rows(rev_data)
corn_price nitrogen revenue
1 3 0 -354.1138
2 4 0 -472.1517
3 5 0 -590.1896
4 3 100 574.6345
5 4 100 766.1793
6 5 100 957.7242
7 3 200 700.3269
8 4 200 933.7692
9 5 200 1167.2115
10 3 300 717.3375
11 4 300 956.4501
12 5 300 1195.5626
#--- rbindlist ---#
rbindlist(rev_data)
corn_price nitrogen revenue
<num> <num> <num>
1: 3 0 -354.1138
2: 4 0 -472.1517
3: 5 0 -590.1896
4: 3 100 574.6345
5: 4 100 766.1793
6: 5 100 957.7242
7: 3 200 700.3269
8: 4 200 933.7692
9: 5 200 1167.2115
10: 3 300 717.3375
11: 4 300 956.4501
12: 5 300 1195.5626
A.1.5 Do you really need to loop?
Actually, we should not have used for loop or lapply()
in any of the examples above in practice105 This is because they can be easily vectorized. Vectorized operations are those that take vectors as inputs and work on each element of the vectors in parallel106.
A typical example of a vectorized operation would be this:
#--- define numeric vectors ---#
x <- 1:1000
y <- 1:1000
#--- element wise addition ---#
z_vec <- x + y
A non-vectorized version of the same calculation is this:
z_la <- lapply(1:1000, function(i) x[i] + y[i]) %>% unlist()
#--- check if identical with z_vec ---#
all.equal(z_la, z_vec)
[1] TRUE
Both produce the same results. However, R is written in a way that is much better at doing vectorized operations. Let’s time them using the microbenchmark()
function from the microbenchmark
package. Here, we do not unlist()
after lapply()
to just focus on the multiplication part.
library(microbenchmark)
microbenchmark(
#--- vectorized ---#
"vectorized" = {
x + y
},
#--- not vectorized ---#
"not vectorized" = {
lapply(1:1000, function(i) x[i] + y[i])
},
times = 100,
unit = "ms"
)
Unit: milliseconds
expr min lq mean median uq max neval
vectorized 0.002706 0.002829 0.00292494 0.002870 0.0029725 0.003977 100
not vectorized 0.282654 0.284499 0.29955912 0.285524 0.2871435 1.508513 100
cld
a
b
As you can see, the vectorized version is faster. The time difference comes from R having to conduct many more internal checks and hidden operations for the non-vectorized one107. Yes, we are talking about a fraction of milliseconds here. But, as the objects to operate on get larger, the difference between vectorized and non-vectorized operations can become substantial108.
The lapply()
examples can be easily vectorized.
Instead of this:
lapply(1:1000, square_it)
You can just do this:
square_it(1:1000)
You can also easily vectorize the revenue calculation demonstrated above. First, define the function differently so that revenue calculation can take corn price and nitrogen vectors and return a revenue vector.
gen_rev_corn_short <- function(corn_price, nitrogen) {
#--- calculate yield ---#
yield <- 240 * (1 - exp(0.4 - 0.02 * nitrogen))
#--- calculate revenue ---#
revenue <- corn_price * yield
return(revenue)
}
Then use the function to calculate revenue and assign it to a new variable in the parameters_data
data.
rev_data_2 <- mutate(
parameters_data,
revenue = gen_rev_corn_short(corn_price, nitrogen)
)
Let’s compare the two:
microbenchmark(
#--- vectorized ---#
"vectorized" = {
rev_data <- mutate(parameters_data, revenue = gen_rev_corn_short(corn_price, nitrogen))
},
#--- not vectorized ---#
"not vectorized" = {
parameters_data$revenue <- lapply(1:nrow(parameters_data), gen_rev_corn)
},
times = 100,
unit = "ms"
)
Unit: milliseconds
expr min lq mean median uq max neval
vectorized 0.264163 0.2731625 0.3273038 0.2873895 0.2975985 4.040181 100
not vectorized 0.885846 0.8979205 0.9236779 0.9207370 0.9378135 1.101629 100
cld
a
b
Yes, the vectorized version is faster. So, the lesson here is that if you can vectorize, then vectorize instead of using lapply()
. But, of course, things cannot be vectorized in many cases.
A.2 Parallelization of embarrassingly parallel processes
Parallelization of computation involves distributing the task at hand to multiple cores so that multiple processes are done in parallel. Here, we learn how to parallelize computation in R. Our focus is on the so called embarrassingly parallel processes. Embarrassingly parallel processes refer to a collection of processes where each process is completely independent of any another. That is, one process does not use the outputs of any of the other processes. The example of integer squaring is embarrassingly parallel. In order to calculate \(1^2\), you do not need to use the result of \(2^2\) or any other squares. Embarrassingly parallel processes are very easy to parallelize because you do not have to worry about which process to complete first to make other processes happen. Fortunately, most of the processes you are interested in parallelizing fall under this category109.
We will use the future_lapply()
function from the future.apply
package for parallelization110. Using the package, parallelization is a piece of cake as it is basically the same syntactically as lapply()
.
#--- load packages ---#
library(future.apply)
You can find out how many cores you have available for parallel computation on your computer using the detectCores()
function from the parallel
package.
library(parallel)
#--- number of all cores ---#
detectCores()
[1] 20
Before we implement parallelized lapply()
, we need to declare what backend process we will be using by plan()
. Here, we use plan(multisession)
111. In the plan()
function, we can specify the number of workers. Here I will use the total number of cores less 1112.
plan(multisession, workers = detectCores() - 1)
future_lapply()
works exactly like lapply()
.
sq_ls <- future_lapply(1:1000, function(x) x^2)
This is it. The only difference you see from the serialized processing using lapply()
is that you changed the function name to future_lapply()
.
Okay, now we know how we parallelize computation. Let’s check how much improvement in implementation time we got by parallelization.
microbenchmark(
#--- parallelized ---#
"parallelized" = {
sq_ls <- future_lapply(1:1000, function(x) x^2)
},
#--- non-parallelized ---#
"not parallelized" = {
sq_ls <- lapply(1:1000, function(x) x^2)
},
times = 100,
unit = "ms"
)
Unit: milliseconds
expr min lq mean median uq
parallelized 1213.63743 1247.100465 1255.5660916 1255.9259195 1262.6401820
not parallelized 0.22591 0.238866 0.2609256 0.2472505 0.2577055
max neval cld
1422.499961 100 a
1.135946 100 b
Hmmmm, okay, so parallelization made the code slower… How could this be? This is because communicating jobs to each core takes some time as well. So, if each of the iterative processes is super fast (like this example where you just square a number), the time spent on communicating with the cores outweighs the time saving due to parallel computation. Parallelization is more beneficial when each of the repetitive processes takes long.
One of the very good use cases of parallelization is MC simulation. The following MC simulation tests whether the correlation between an independent variable and error term would cause bias (yes, we know the answer). The MC_sim
function first generates a dataset (50,000 observations) according to the following data generating process:
\[ y = 1 + x + v \]
where \(\mu \sim N(0,1)\), \(x \sim N(0,1) + \mu\), and \(v \sim N(0,1) + \mu\). The \(\mu\) term cause correlation between \(x\) (the covariate) and \(v\) (the error term). It then estimates the coefficient on \(x\) vis OLS, and return the estimate. We would like to repeat this process 1,000 times to understand the property of the OLS estimators under the data generating process. This Monte Carlo simulation is embarrassingly parallel because each process is independent of any other.
#--- repeat steps 1-3 B times ---#
MC_sim <- function(i) {
N <- 50000 # sample size
#--- steps 1 and 2: ---#
mu <- rnorm(N) # the common term shared by both x and u
x <- rnorm(N) + mu # independent variable
v <- rnorm(N) + mu # error
y <- 1 + x + v # dependent variable
data <- data.table(y = y, x = x)
#--- OLS ---#
reg <- lm(y ~ x, data = data) # OLS
#--- return the coef ---#
return(reg$coef["x"])
}
Let’s run one iteration,
tic()
MC_sim(1)
toc()
x
1.503353
elapsed
0.008
Okay, so it takes 0.008 second for one iteration. Now, let’s run this 1000 times with or without parallelization.
Not parallelized
#--- non-parallel ---#
tic()
MC_results <- lapply(1:1000, MC_sim)
toc()
elapsed
8.811
Parallelized
#--- parallel ---#
tic()
MC_results <- future_lapply(1:1000, MC_sim)
toc()
elapsed
2.653
As you can see, parallelization makes it much quicker with a noticeable difference in elapsed time. We made the code 3.32 times faster. However, we did not make the process 19 times faster even though we used 19 cores for the parallelized process. This is because of the overhead associated with distributing tasks to the cores. The relative advantage of parallelization would be greater if each iteration took more time. For example, if you are running a process that takes about 2 minutes for 1000 times, it would take approximately 33 hours and 20 minutes. But, it may take only 4 hours if you parallelize it on 19 cores, or maybe even 2 hours if you run it on 30 cores.
A.2.1 Mac or Linux users
For Mac users, parallel::mclapply()
is just as compelling (or pbmclapply::pbmclapply()
if you want to have a nice progress report, which is very helpful particularly when the process is long). It is just as easy to use as future_lapply()
because its syntax is the same as lapply()
. You can control the number of cores to employ by adding mc.cores
option. Here is an example code that does the same MC simulations we conducted above:
#--- mclapply ---#
library(parallel)
MC_results <- mclapply(1:1000, MC_sim, mc.cores = detectCores() - 1)
#--- or with progress bar ---#
library(pbmclapply)
MC_results <- pbmclapply(1:1000, MC_sim, mc.cores = detectCores() - 1)