A.1 Repetitive processes and looping

A.1.1 What is looping?

We sometimes need to run the same process over and over again often with slight changes in parameters. In such a case, it is very time-consuming and messy to write all of the steps one bye one. For example, suppose you are interested in knowing the square of 1 through 5 ($[1, 2, 3, 4, 5]$). The following code certainly works:

1^2

[1] 1

2^2

[1] 4

3^2

[1] 9

4^2

[1] 16

5^2

[1] 25

However, imagine you have to do this for 1000 integers. Yes, you don’t want to write each one of them one by one as that would occupy 1000 lines of your code, and it would be time-consuming. Things will be even worse if you need to repeat much more complicated processes like Monte Carlo simulations. So, let’s learn how to write a program to do repetitive jobs effectively using loop.

Looping is repeatedly evaluating the same (except parameters) process over and over again. In the example above, the same process is the action of squaring. This does not change among the processes you run. What changes is what you square. Looping can help you write a concise code to implement these repetitive processes.

A.1.2 For loop

Here is how for loop works in general:

#--- NOT RUN ---#
for (x in a_list_of_values){
  you do what you want to do with x
}

As an example, let’s use this looping syntax to get the same results as the manual squaring of 1 through 5:

for (x in 1:5){
  print(x^2)
}

[1] 1
[1] 4
[1] 9
[1] 16
[1] 25

Here, a list of values is $1, 2, 3, 4, 5]$. For each value in the list, you square it (x^2) and then print it (print()). If you want to get the square of $1:1000$, the only thing you need to change is the list of values to loop over as in:

#--- evaluation not reported as it's too long ---#
for (x in 1:1000){
  print(x^2)
}

So, the length of the code does not depend on how many repeats you do, which is an obvious improvement over manual typing of every single process one by one. Note that you do not have to use $x$ to refer to an object you are going to use. It could be any combination of letters as long as you use it when you code what you want to do inside the loop. So, this would work just fine,

for (bluh_bluh_bluh in 1:5){
  print(bluh_bluh_bluh^2)
}

[1] 1
[1] 4
[1] 9
[1] 16
[1] 25

A.1.3 For loop using the `lapply()` function

You can do for loop using the lapply() function as well.⁹⁵ Here is how it works:

#--- NOT RUN ---#  
lapply(A,B)

where $A$ is the list of values you go through one by one in the order the values are stored, and $B$ is the function you would like to apply to each of the values in $A$. For example, the following code does exactly the same thing as the above for loop example.

lapply(1:5, function(x){x^2})

[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

[[4]]
[1] 16

[[5]]
[1] 25

Here, $A$ is $[1, 2, 3, 4, 5]$. In $B$ you have a function that takes $x$ and square it. So, the above code applies the function to each of $[1, 2, 3, 4, 5]$ one by one. In many circumstances, you can write the same looping actions in a much more concise manner using the lapply function than explicitly writing out the loop process as in the above for loop examples. You might have noticed that the output is a list. Yes, lapply() returns the outcomes in a list. That is where l in lapply() comes from.

When the operation you would like to repeat becomes complicated (almost always the case), it is advisable that you create a function of that process first.

#--- define the function first ---#
square_it <- function(x){
  return(x^2)
}

#--- lapply using the pre-defined function ---#
lapply(1:5, square_it)

[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

[[4]]
[1] 16

[[5]]
[1] 25

Finally, it is a myth that you should always use lapply() instead of the explicit for loop syntax because lapply() (or other apply() families) is faster. They are basically the same.⁹⁶

A.1.4 Looping over multiple variables using lapply()

lapply() allows you to loop over only one variable. However, it is often the case that you want to loop over multiple variables⁹⁷. However, it is easy to achieve this. The trick is to create a data.frame of the variables where the complete list of the combinations of the variables are stored, and then loop over row of the data.frame. As an example, suppose we are interested in understanding the sensitivity of corn revenue to corn price and applied nitrogen amount. We consider the range of $3.0/bu to $5.0/bu for corn price and 0 lb/acre to 300/acre for nitrogen rate.

#--- corn price vector ---#
corn_price_vec <- seq(3, 5, by = 1)

#--- nitrogen vector ---#
nitrogen_vec <- seq(0, 300, by = 100)

After creating vectors of the parameters, you combine them to create a complete combination of the parameters using the expand.grid() function, and then convert it to a data.frame object⁹⁸.

#--- crate a data.frame that holds parameter sets to loop over ---#
parameters_data <- expand.grid(corn_price = corn_price_vec, nitrogen = nitrogen_vec) %>% 
  #--- convert the matrix to a data.frame ---#
  data.frame()

#--- take a look ---#
parameters_data

   corn_price nitrogen
1           3        0
2           4        0
3           5        0
4           3      100
5           4      100
6           5      100
7           3      200
8           4      200
9           5      200
10          3      300
11          4      300
12          5      300

We now define a function that takes a row number, refer to parameters_data to extract the parameters stored at the row number, and then calculate corn yield and revenue based on the extracted parameters.

gen_rev_corn <- function(i) {

  #--- define corn price ---#
  corn_price <- parameters_data[i,'corn_price']

  #--- define nitrogen  ---#
  nitrogen <- parameters_data[i,'nitrogen']

  #--- calculate yield ---#
  yield <- 240 * (1 - exp(0.4 - 0.02 * nitrogen))

  #--- calculate revenue ---#
  revenue <- corn_price * yield 

  #--- combine all the information you would like to have  ---#
  data_to_return <- data.frame(
    corn_price = corn_price,
    nitrogen = nitrogen,
    revenue = revenue
  )

  return(data_to_return)
}

This function takes $i$ (act as a row number within the function), extract corn price and nitrogen from the $i$th row of parameters_mat, which are then used to calculate yield and revenue⁹⁹. Finally, it returns a data.frame of all the information you used (the parameters and the outcomes).

#--- loop over all the parameter combinations ---#
rev_data <- lapply(1:nrow(parameters_data), gen_rev_corn)

#--- take a look ---#
rev_data

[[1]]
  corn_price nitrogen   revenue
1          3        0 -354.1138

[[2]]
  corn_price nitrogen   revenue
1          4        0 -472.1517

[[3]]
  corn_price nitrogen   revenue
1          5        0 -590.1896

[[4]]
  corn_price nitrogen  revenue
1          3      100 574.6345

[[5]]
  corn_price nitrogen  revenue
1          4      100 766.1793

[[6]]
  corn_price nitrogen  revenue
1          5      100 957.7242

[[7]]
  corn_price nitrogen  revenue
1          3      200 700.3269

[[8]]
  corn_price nitrogen  revenue
1          4      200 933.7692

[[9]]
  corn_price nitrogen  revenue
1          5      200 1167.212

[[10]]
  corn_price nitrogen  revenue
1          3      300 717.3375

[[11]]
  corn_price nitrogen  revenue
1          4      300 956.4501

[[12]]
  corn_price nitrogen  revenue
1          5      300 1195.563

Successful! Now, for us to use the outcome for other purposes like further analysis and visualization, we would need to have all the results combined into a single data.frame instead of a list of data.frames. To do this, use either bind_rows() from the dplyr package or rbindlist() from the data.table package.

#--- bind_rows ---#
bind_rows(rev_data)

   corn_price nitrogen   revenue
1           3        0 -354.1138
2           4        0 -472.1517
3           5        0 -590.1896
4           3      100  574.6345
5           4      100  766.1793
6           5      100  957.7242
7           3      200  700.3269
8           4      200  933.7692
9           5      200 1167.2115
10          3      300  717.3375
11          4      300  956.4501
12          5      300 1195.5626

#--- rbindlist ---#
rbindlist(rev_data)

    corn_price nitrogen   revenue
 1:          3        0 -354.1138
 2:          4        0 -472.1517
 3:          5        0 -590.1896
 4:          3      100  574.6345
 5:          4      100  766.1793
 6:          5      100  957.7242
 7:          3      200  700.3269
 8:          4      200  933.7692
 9:          5      200 1167.2115
10:          3      300  717.3375
11:          4      300  956.4501
12:          5      300 1195.5626

A.1.5 Do you really need to loop?

Actually, we should not have used for loop or lapply() in any of the examples above in practice¹⁰⁰ This is because they can be easily vectorized. Vectorized operations are those that take vectors as inputs and work on each element of the vectors in parallel¹⁰¹.

A typical example of a vectorized operation would be this:

#--- define numeric vectors ---#
x <- 1:1000
y <- 1:1000

#--- element wise addition ---#
z_vec <- x + y

A non-vectorized version of the same calculation is this:

z_la <- lapply(1:1000, function(i) x[i] + y[i]) %>%  unlist

#--- check if identical with z_vec ---#
all.equal(z_la, z_vec)

[1] TRUE

Both produce the same results. However, R is written in a way that is much better at doing vectorized operations. Let’s time them using the microbenchmark() function from the microbenchmark package. Here, we do not unlist() after lapply() to just focus on the multiplication part.

library(microbenchmark)

microbenchmark(
  #--- vectorized ---#
  "vectorized" = { x + y }, 
  #--- not vectorized ---#
  "not vectorized" = { lapply(1:1000, function(i) x[i] + y[i])},
  times = 100, 
  unit = "ms"
)

Unit: milliseconds
           expr      min        lq       mean   median        uq       max
     vectorized 0.002428 0.0028800 0.00316221 0.003082 0.0033105  0.005597
 not vectorized 0.496875 0.5368535 0.72333386 0.552854 0.5681425 14.101327
 neval
   100
   100

As you can see, the vectorized version is faster. The time difference comes from R having to conduct many more internal checks and hidden operations for the non-vectorized one¹⁰². Yes, we are talking about a fraction of milliseconds here. But, as the objects to operate on get larger, the difference between vectorized and non-vectorized operations can become substantial¹⁰³.

The lapply() examples can be easily vectorized.

Instead of this:

lapply(1:1000, square_it)

You can just do this:

square_it(1:1000)

You can also easily vectorize the revenue calculation demonstrated above. First, define the function differently so that revenue calculation can take corn price and nitrogen vectors and return a revenue vector.

gen_rev_corn_short <- function(corn_price, nitrogen) {

  #--- calculate yield ---#
  yield <- 240 * (1 - exp(0.4 - 0.02 * nitrogen))

  #--- calculate revenue ---#
  revenue <- corn_price * yield 

  return(revenue)
}

Then use the function to calculate revenue and assign it to a new variable in the parameters_data data.

rev_data_2 <- mutate(
  parameters_data,
  revenue = gen_rev_corn_short(corn_price, nitrogen)
)

Let’s compare the two:

microbenchmark(

  #--- vectorized ---#
  "vectorized" = { rev_data <- mutate(parameters_data, revenue = gen_rev_corn_short(corn_price, nitrogen)) },
  #--- not vectorized ---#
  "not vectorized" = { parameters_data$revenue <- lapply(1:nrow(parameters_data), gen_rev_corn) },
  times = 100, 
  unit = "ms"
)

Unit: milliseconds
           expr      min       lq     mean   median       uq      max neval
     vectorized 0.961107 1.041289 1.554750 1.129748 1.309716 23.56163   100
 not vectorized 1.894142 2.118015 2.514545 2.282599 2.796153  5.66208   100

Yes, the vectorized version is faster. So, the lesson here is that if you can vectorize, then vectorize instead of using lapply(). But, of course, things cannot be vectorized in many cases.

lpply() in only one of the family of apply() functions. We do not talk about other types of apply() functions here (e.g., apply(), spply(), mapply(),, tapply()). Personally, I found myself only rarely using them. But, if you are interested in learning those, take a look at here or here.↩︎
Check this discussion on StackOverflow. You might want to check out this video at 6:10 as well.↩︎
the map() function from the purrr package allows you to loop over two variable.↩︎
Converting to a data.frame is not strictly necessary.↩︎
Yield is generated based on the Mitscherlich-Baule functional form. Yield increases at the decreasing rate as you apply more nitrogen, and yield eventually hits the plateau.↩︎
By the way, note that lapply() is no magic. It’s basically a for loop and not rally any faster than for loop.↩︎
This does not mean that the process is parallelized by using multiple cores.↩︎
See this or this to have a better understanding of why non-vectorized operations can be slower than vectorized operations.↩︎
See here for a good example of such a case. R is often regarded very slow compared to other popular software. But, many of such claims come from not vectorizing operations that can be vectorized. Indeed, many of the base and old R functions are written in C. More recent functions relies on C++ via the Rcpp package.↩︎