05-1: User-defined Function, Loop, and Parallelization

Tips to make the most of the lecture notes

Interactive navigation tools
Running and writing codes

Click on the three horizontally stacked lines at the bottom left corner of the slide, then you will see table of contents, and you can jump to the section you want
Hit letter “o” on your keyboard and you will have a panel view of all the slides

The box area with a hint of blue as the background color is where you can write code (hereafter referred to as the “code area”).
Hit the “Run Code” button to execute all the code inside the code area.
You can evaluate (run) code selectively by highlighting the parts you want to run and hitting Command + Enter for Mac (Ctrl + Enter for Windows).
If you want to run the codes on your computer, you can first click on the icon with two sheets of paper stacked on top of each other (top right corner of the code chunk), which copies the code in the code area. You can then paste it onto your computer.
You can click on the reload button (top right corner of the code chunk, left to the copy button) to revert back to the original code.

Learning objectives

Learn how to write functions yourself
Learn how to use a for loop and lapply() to complete repetitive jobs
Learn how not to loop things that can be easily vectorized
Learn how to parallelize repetitive jobs using the future_lapply() function from the future.apply package

User-defined Function

When?
How
Example
Scope 1
Scope 2
default value
Multiple arguments
Exercises

It is beneficial to write your own function when you expect to repeat the same action with different inputs to the action.

Example: mean()

Calculating the average of an variable is such a common task

You do not want to do sum(x)/length(x) every time you get a mean
You can just use the mean() function

A function is more useful when the task is longer and more complicated.

Here is the general structure of a user-defined function:

#--- NOT RUN ---#
function_name <- function(x) {

  1. do something on x

  2. return the results

}

Then, you can use the function like this:

#--- NOT RUN ---#
function_name(x)

Note: the argument does not have to be named x

The following function takes numeric numbers, square them, and return the squared values:

Try the function:

Any objects that are defined within a function are not registered (created) to the global environment.

For example, squared_a and original_a are defined only internally, and are not registered on the global environment.

You can confirm this by looking at the environment tab in RStudio after running the following:

Even though we are returning original_a, only its content is returned.

When R sees objects that are not provided explicitly as arguments to the function, then R looks for them in the global environment:

Here, multiplier is provided as an argument to the function.

Try this:

Now, define multiplier on the global environment, and then run it again:

You can set default values for function arguments by argument = value like below:

Try this:

It is easy to create a function with multiple arguments. You just simply add more arguments within function() like below:

As you are likely to have noticed, the order of input arguments are assumed to be the same as the order of the arguments of the function. Above,

a = 4
b = 3

You can mess with the order of input arguments if you want if you name the input arguments as follows:

Exercise 1
Exercise 2
Exercise 3

Define a function that takes temperature in Fahrenheit, convert it into Celsius, and return it.

Here is the formula: temp_C <- (temp_F - 32) * 5 / 9

Work here

Answer

Code

f_to_c <- function(temp_F) {
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}

After running a randomized nitrogen trial, you found the following relationship between corn yield (bu/acre) and nitrogen rate (lb/acre):

\[ \mbox{corn yield} = 120 + 25 \times log(\mbox{nitrogen rate}) \]

Write a function that takes a nitrogen rate as an argument, calculate the estimated yield for the nitrogen rate, and then return the estimated yield.

Work here

Answer

Code

n_to_yield <- function(N) {
  yield <- 120 + 25 * log(N)
  return(yield)
}

You would like to calculate the revenue of corn production (defined below) as a function of nitrogen rate based on the yield response function.

\[\begin{align*} P_{corn} * \mbox{corn yield} \end{align*}\]

Write a function that takes corn price and nitrogen rate as its arguments, calculate revenue (and yield as an intermediate step), and return revenue. In doing so, use the function you created in Exercise 2.

Work here

Answer

Code

calc_rev <- function(corn_price, N) {
  yield <- n_to_yield(N)
  revenue <- corn_price * yield
  return(revenue)
}

Loop

Motivations
What is it?
Example
Exercise

We sometimes need to run the same process over and over again often with slight changes in parameters.
In such a case, it is very time-consuming and messy to write all of the steps one bye one.
For example, suppose you are interested in knowing the square of 1 through 5 with a step of 1 ([1,2,3,4,5]). The following code certainly works:

1^2 
2^2 
3^2 
4^2 
5^2

However, imagine you have to do this for 1000 integers.
Yes, you don’t want to write each one of them one by one as that would occupy 1000 lines of your code, and it would be time-consuming.
Things become even more challenging when you need to repeat far more complex processes, such as Monte Carlo simulations. To handle such repetitive tasks efficiently, it is important to learn how to write programs that automate these jobs using loops.

Looping means repeatedly executing the same process, with only the parameters changing.
In the example above:
- The repeated process is the action of squaring.
- The parameter that changes is what number gets squared.

Syntax

#--- NOT RUN ---#
for (x in a_list_of_values){
  you do what you want to do with x
}

This does the same:

Write a for loop that cubes each element of a sequence starting at 5, increasing by 5, and ending at 50.

Work here

Answer

Code

for (i in seq(5, 50, by = 5)){
  i^3
}

Instead of using a for loop, we can use the lapply() function from the base package to loop.

Syntax

#--- NOT RUN ---#  
lapply(A, B)

A is the list of values
B is the function you would like to apply to each of the values in A
- You can use an existing function (e.g., mean())
- You can also write a function on the spot

Note:

When A is a vector, lapply() works on each of the vector elements
When A is a list, lapply() works on each of the list elements whatever they may be
When A is a data.frame, lapply() works on each of the columns (data.frame is a list of columns of equal length)

This does the same thing as the for loop example we looked at earlier:

\(x): shorthand for function(x)

The key difference from a for loop is the object class of the output after the loop.

Important: the output type of lappy() is always a list (that’s why it is called lapply)

It is often the case that you want to write a function of the action you intend to repeat first and then loop.

For example, for the loop of squaring numbers, you can first define a function that implements the action of squaring:

And then loop:

Often times, you would like to loop over a single parameter of a function that has multiple arguments:

For example, you would like to fix the value of b at 5 while trying different values of a of the following function:

Then you can do this:

As you can see, this function try each of 1:10 (called internally x), give it to square_them_add() as its first argument while b is fixed at 5.

Exercise 1
Exercise 2

Use lapply() to cube each element of a sequence starting at 5, increasing by 5, and ending at 50.

Work here

Answer

Code

lapply(seq(5, 50, by = 5), \(x) x^3)

Define a function that takes a nitrogen rate and corn price as arguments and calculate revenue. Yield can be estimated using the following equation:

\[ \mbox{corn yield} = 120 + 25 \times log(\mbox{nitrogen rate}) \]

At each value of the corn price sequence of seq(2.5, 4.0, by = 0.1), calculate the revenue using lapply() where nitrogen rate is fixed at 200 (lb/acre).

Work here

Answer

Code

n_to_yield <- function(N) {
  yield <- 120 + 25 * log(N)
  return(yield)
}

corn_price_seq <- seq(2.5, 4.0, by = 0.1)

lapply(corn_price_seq, \(x) n_to_yield(x))

Looping over multiple variables

Motivations
Example
Steps
Tips to write a function for loop
Exercise

The example we have looked at is a very simple case where a loop is done over a single list of values
It is often the case that you want to loop over multiple variables.

Example

You are interested in understanding the sensitivity of the profitability of corn production with respect to corn price and nitrogen application rate.

So, you would like to loop over two sets of sequences of values:

corn price
nitrogen application rate

How

The trick is to

create a data.frame of two (or as many variables as you would like to loop over) variables (corn price and nitrogen application rate), which stores all the permutations of the two variables
then loop over the rows of the data.frame

We are interested in understanding the sensitivity of corn revenue to corn price and applied nitrogen amount.
We consider
- the range of $3.0/bu to $5.0/bu for corn price
- 0 lb/acre to 300/acre for nitrogen rate

Step 1
Step 2
Step 3
Step 4

Get a sequence of values for corn price and nitrogen rate:

We then create a complete combination of the values using the expand.grid() function. The resulting object is a data.frame.

Define a function that

takes a row number
refer to parameters_data to extract the parameters stored at the row number
calculate corn yield and revenue based on the extracted parameters (corn price and nitrogen rate).

This function

takes i (act as a row number within the function)
extract corn price and nitrogen from the ith row of parameters_mat
use the extracted values to calculate yield and revenue
create a data.frame of the resulting revenue, corn price, and nitrogen rate
returns the data.frame

Do a loop using lapply():

Combine the list of data.frames into a single data.frame using bind_rows() from the dplyr package.

Before define a function, write a code that works for one row.

We will work on a specific value of i. Here is it i = 1.

After you confirm the code you write gives you desired outcomes, make it a function by replacing 1 with i.

Instruction
Workspace

Find the profit of corn production at different price combinations of corn and nitrogen where nitrogen rate is fixed at 200 lb/acre.

Step 1: Define the following sequences of numbers
- corn price: seq(2.5, 4.0, by = 0.05)
- nitrogen price: seq(0.2, 0.6, by = 0.01)
Step 2: Create a data.frame of the complete combinations of the values from the price vectors
Step 3: Define a function that takes a row number, extract corn price and nitrogen price and then calculate profit based on the price combination using the following equations:

\[ \mbox{corn yield} = 120 + 25 \times log(\mbox{nitrogen rate}) \]

\[ \mbox{profit} = \mbox{corn price} \times \mbox{corn yield} - \mbox{nitrogen price} \times \mbox{nitrogen rate} \]

Step 4: loop over the row numbers of the parameter data

Answer

Code

corn_price_vec <- seq(2.5, 4.0, by = 0.05)
n_price_vec <- seq(0.2, 0.6, by = 0.02)

cases_data <- 
  expand.grid(
    corn_price = corn_price_vec, 
    n_price = n_price_vec
  )

get_profit <- function(i) {
  corn_price <- cases_data[i, "corn_price"]
  n_price <- cases_data[i, "n_price"]

  corn_yield <- 120 + 25 * log(200)
  profit <- corn_price * corn_yield - n_price * 200

  data_to_return <-
    data.frame(
      corn_price = corn_price,
      n_price = n_price,
      profit = profit
    )
    
  return(data_to_return)
}

lapply(
  1:nrow(cases_data),
  \(i) get_profit(i)
) %>%
  bind_rows()

Do you really need to loop?

Why not?
compare
time
vectorize 1
vectorize 2
compare

Actually, we should not have used a for loop or lapply() in any of the examples above in practice.
This is because they can be easily vectorized.
Vectorized operations are those that take vectors as inputs and work on each element of the vectors in parallel

Example

Vectorized

Non-vectorized (loop)

Compare

Both produce the same results. However, R is written in a way that is much better at doing vectorized operations.

Let’s time them using the microbenchmark() function from the microbenchmark package.

Here, we do not unlist() after lapply() to just focus on the multiplication part.

As you can see, the vectorized version is faster.
The time difference comes from R having to conduct many more internal checks and hidden operations for the non-vectorized one

Instead of this:

You can just do this:

Here is the vectorized version of the revenue sensitivity analysis:

Then use the function to calculate revenue and assign it to a new variable in the parameters_data data.

Let’s compare the vectorized and non-vectorized version:

Parallelized computation

Parallel processing

Intro
Instruction
Preparation
Try it
Any faster?
What just happened?

Parallelization of computation involves distributing the task at hand to multiple cores so that multiple processes are done in parallel.
Our focus is on the so called embarrassingly parallel processes.

Embarrassingly parallel process: a collection of processes where each process is completely independent of any another (one process does not use the outputs of any of the other processes)

The example of integer squaring is embarrassingly parallel. In order to calculate 12, you do not need to use the result of 22 or any other squares.
Embarrassingly parallel processes are very easy to parallelize because you do not have to worry about which process to complete first to make other processes happen.
Fortunately, most of the processes you are interested in parallelizing fall under this category

We will use the future_lapply() function from the future.apply package for parallelization.
Using the package, parallelization is a piece of cake as it is basically the same syntactically as lapply().

#--- install the package ---#
install.packages("future.apply") 

#--- load packages ---#
library(future.apply)

How

You can simply replace lapply() with future_lapply()!

#--- not parallelized ---#
sq_ls <- lapply(1:1000, function(x) x^2) 

#--- parallelized ---#
sq_ls_par <- future_lapply(1:1000, function(x) x^2)

You can find out how many cores you have available for parallel computation on your computer using the detectCores() function from the parallel package.

#--- number of all cores ---#
parallel::detectCores()

[1] 16

Before we implement parallelized lapply(), we need to declare what backend process we will be using by plan().

plan(multicore, workers = parallel::detectCores() - 1)

Other backend processes are:

sequential: this is just a regular loop
multicore: forked sessions (not available on Windows)
multisession: multiple sessions (less performant thana multicore)

With the multicore option, R figure out which multicore or multisession should be used (or can be used) and automatically redirect the backend process to the appropriate (available) one.

Note

Unless you tell R explicitly to parallelize things (like using future_lapply()), R always uses a single core by default. So, you do not have to change anything manually when you do not want to use multiple cores.

sq_ls <- future_lapply(1:1000, function(x) x^2)

microbenchmark(

  #--- parallelized ---#
  "parallelized" = { sq_ls <- future_lapply(1:1000, function(x) x^2) }, 

  #--- non-parallelized ---#
  "not parallelized" = { sq_ls <- lapply(1:1000, function(x) x^2) },
  times = 100, 
  unit = "ms"

)

Unit: milliseconds
             expr       min       lq       mean    median        uq        max
     parallelized 72.355324 77.92868 84.9490324 82.666004 87.741825 111.590028
 not parallelized  0.170806  0.22468  0.3083102  0.266664  0.324966   1.613145
 neval cld
   100  a 
   100   b

This is because communicating jobs to each core takes some time as well.
So, if each of the iterative processes is super fast (like this example where you just square a number), the time spent on communicating with the cores outweighs the time saving due to parallel computation.
Parallelization is more beneficial when each of the repetitive processes takes long.

Parallel processing: a less trivial example

MC simulation
MC steps
function
performance
Mac/Linux

One of the very good use cases of parallelization is MC simulation
We will run MC simulations that test whether the correlation between an independent variable and error term would cause bias (yes, we know the answer).

generate a dataset (50,000 observations) according to the following data generating process:

\[ y = 1 + x + v \]

where + $\mu \sim N(0,1)$ + $x \sim N(0,1) + \mu$ + $v \sim N(0,1) + \mu$.

The $\mu$ term cause correlation between $x$ (the covariate) and $v$ (the error term).

estimate the coefficient on $x$ vis OLS, and return the estimate.
repeat this process $1,000$ times to understand the property of the OLS estimators under the data generating process.

This Monte Carlo simulation is embarrassingly parallel because each process is independent of any other.

Here is a function that implements the steps described in the previous slide:

#--- repeat steps 1-3 B times ---#
MC_sim <- function(i){

  N <- 50000 # sample size

  #--- steps 1 and 2:  ---#
  mu <- rnorm(N) # the common term shared by both x and u
  x <- rnorm(N) + mu # independent variable
  v <- rnorm(N) + mu # error
  y <- 1 + x + v # dependent variable
  data <- data.table(y = y, x = x)

  #--- OLS ---# 
  reg <- lm(y~x, data = data) # OLS

  #--- return the coef ---#
  return(reg$coef['x'])
}

Single run:

tic()
single_res <- MC_sim(1)
toc()

0.007 sec elapsed

Not parallelized (sequential):

tic()
MC_results <- lapply(1:1000, MC_sim)
toc()

6.782 sec elapsed

Parallelized:

tic()
MC_results <- future_lapply(1:1000, MC_sim)
toc()

0.971 sec elapsed

For Mac or Linux users, parallel::mclapply() is just as compelling (or pbmclapply::pbmclapply() if you want to have a nice progress report, which is very helpful particularly when the process is long).
It is just as easy to use as future_lapply() because its syntax is the same as lapply().
You can control the number of cores to employ by adding mc.cores option. Here is an example code that does the same MC simulations we conducted above:

#--- mclapply ---#
library(parallel)
MC_results <- mclapply(1:1000, MC_sim, mc.cores = detectCores() - 1)

#--- or with progress bar ---#
library(pbmclapply)
MC_results <- pbmclapply(1:1000, MC_sim, mc.cores = detectCores() - 1)

Low-dimensional optimization

Setup
Grid search
Example
Profit-N
Best N

Suppose you have ran an randomized nitrogen experiment for corn production on a field, collected data, and run a regression to find the following quantitative relationship between corn yield (bu/acre) and nitrogen rate (lb/acre):

\[ \mbox{corn yield} = 120 + 25 \times log(\mbox{nitrogen rate}) \]

Your are interested in finding the best nitrogen rates that maximize profit at different combinations of corn and nitrogen prices for this field.

\[ Max_{N}\;\; P_C \times Y(N) - P_N \times N \]

$N$: nitrogen rate (lb/acre)
$Y(N)$: corn yield (bu/acre) as a function of $N$ as described above
$P_C$: corn price ($/bu)
$P_N$: nitrogen price ($/lb)

Here, $N$ is a decision variable, and $P_C$ and $P_N$ are parameters.

Grid search is a very inefficient yet effective tool for finding solutions to optimization problems as long as the dimension of the optimization problem is low (1 or 2 variables).

Grid search basically evaluates the objective function (profit here) at many levels of the decision variables (nitrogen here) and pick the one that maximizes the objective function (profit).

Example

Let’s define a function that takes N, P_C, and P_N values and returns profits.

Let’s create a sequence of N values and then calculate profit at each level of N.

Here is the visualization of profit-N relationship:

Once the profit-N relationship is found, we can use dplyr::filter() combined with max() to identify the optimal N rate.

Alternatively, you can sort the data by profit in the ascending order (default) and pick the last row using dplyr::slice(n()).

This method is faster than the first one.

Coding strategy 1: looping

loop
Step 1
Step 2
Step 3

Now that you have written codes to find the optimal N at a given combination of corn and nitrogen prices.

We can move on to the next step of finding the optimal N rates at many various combinations of corn and nitrogen prices.

Here is the coding strategy:

Define a set of all the combinations of corn and nitrogen prices you want to analyze as a data.frame.
Define a function that extract corn and nitrogen prices from the parameter data.frame and find the optimal N rate at the given combination of prices.
Loop over the price combinations (loop over the rows of the data.frame created in step 1).

Here, we define a data.frame of parameters to be explored. We will be looping over the rows of the data.frame.

Now, we will define a function that extract a combination of corn and nitrogen prices from price_parameters (extract a row from price_parameters), and then find the optimal N.

Check if this function works:

Loop:

Combine the list of data.frames into a single data.frame using bind_rows().

Coding strategy 2: vectorized

vectorized
Step 1
Steps 2 and 3

Instead of writing a loop like above, we can actually vectorize the process. Here are the steps:

Define a set of all the combinations of nitrogen rate, corn price, and nitrogen price you want to analyze as a data.frame.
Calculate profits for all the combinations of nitrogen rate, corn price, and nitrogen price inside the data.frame
Find the optimal N rate for each of the combinations of nitrogen rate, corn price, and nitrogen price

Here, we define all the combinations of nitrogen rate, corn price, and nitrogen price you want to analyze as a data.frame.

Now, we will calculate profit for all the rows in price_parameters.

Now, we can identify the optimal N rate at each of the corn and nitrogen combinations:

Which strategy?

Which strategy you should take depends on the size of your computer’s RMA memory.
Going over the RAM memory limit will suffocate your computer, which leads to a substantial loss in computing performance.
Vectorized version is more memory-hungry:
- Strategy 1: loop over the price combinations (one price combination at a time)
- Strategy 2: loop over price (all the price combinations at the same time)
If you can fit the entire dataset in the RAM memory, then take Strategy 2. Otherwise, break up the entire task into pieces like Strategy 1.

Keep track of RAM memory usage

Mac users: go to Applications $\rightarrow$ Utilities $\rightarrow$ Activity Monitor
Windows users: press Windows Key + R $\rightarrow$ type “resmon” into the search box

Optimization (fixed attribute combinations)

setup
data
Strategy 1
Strategy 2

\[ \mbox{corn yield} = 120 + (EC/40) \times (1 + slope)\times 25 \times log(\mbox{nitrogen rate}) \]

You are interested in finding the best nitrogen rates that maximize profit for different parts of the field at a given corn and nitrogen price combination.

\[ Max_{N} P_C \times [120 + (EC/40) \times (1 + slope) * 25 \times log(\mbox{nitrogen rate})] - P_N \times N \]

N: nitrogen rate (lb/acre)
slope: the slope
EC: electrical conductivity
P_C: corn price ($/bu)
P_N: nitrogen price ($/lb)

Here, N is a decision variable, slope and EC are attributes, and P_C and P_N are parameters.

Data

Consider a 2-plot field like below for the sake of tractability:

Objective

You want to find the optimal nitrogen rate for each plot for a give combination of corn and nitrogen prices.

You can expand on all the variables, nitrogen rate (decision variable), slope and EC (attributes), and corn and nitrogen prices (parameters):

You only need some of the rows in this dataset because no plots have slope-ec combinations of c(0, 30) and c(0.2, 40)
You created unnecessary rows with this approach

Instead of applying expand.grid() on all the three vectors using expand.grid(), we can use expand.grid.df() from the reshape package as follows.

This generates all unique combinations of rows from the two data frames, without creating any observations that are not possible in reality.

05-1: User-defined Function, Loop, and Parallelization

Tips to make the most of the lecture notes

Learning objectives

User-defined Function

User-defined Function

Loop

Loop

Looping with lapply()

Looping over multiple variables

Do you really need to loop?

Parallelized computation

Parallel processing

Parallel processing: a less trivial example

Low-dimensional optimization

Low-dimensional optimization

Coding strategy 1: looping

Coding strategy 2: vectorized

Which strategy?

Optimization (fixed attribute combinations)

Looping with `lapply()`