05-1: User-defined Function, Loop, and Parallelization

Tips to make the most of the lecture notes

  • Click on the three horizontally stacked lines at the bottom left corner of the slide, then you will see table of contents, and you can jump to the section you want

  • Hit letter “o” on your keyboard and you will have a panel view of all the slides

  • The box area with a hint of blue as the background color is where you can write code (hereafter referred to as the “code area”).
  • Hit the “Run Code” button to execute all the code inside the code area.
  • You can evaluate (run) code selectively by highlighting the parts you want to run and hitting Command + Enter for Mac (Ctrl + Enter for Windows).
  • If you want to run the codes on your computer, you can first click on the icon with two sheets of paper stacked on top of each other (top right corner of the code chunk), which copies the code in the code area. You can then paste it onto your computer.
  • You can click on the reload button (top right corner of the code chunk, left to the copy button) to revert back to the original code.

Learning objectives

  • Learn how to write functions yourself
  • Learn how to use a for loop and lapply() to complete repetitive jobs
  • Learn how not to loop things that can be easily vectorized
  • Learn how to parallelize repetitive jobs using the future_lapply() function from the future.apply package

User-defined Function


User-defined Function

It is beneficial to write your own function when you expect to repeat the same action with different inputs to the action.


Example: mean()

Calculating the average of an variable is such a common task

  • You do not want to do sum(x)/length(x) every time you get a mean
  • You can just use the mean() function

A function is more useful when the task is longer and more complicated.

Here is the general structure of a user-defined function:

#--- NOT RUN ---#
function_name <- function(x) {

  1. do something on x

  2. return the results

}


Then, you can use the function like this:

#--- NOT RUN ---#
function_name(x)


Note: the argument does not have to be named x

The following function takes numeric numbers, square them, and return the squared values:


Try the function:

Any objects that are defined within a function are not created on the global environment.

For example, squared_a and original_a are defined only internally, and are not registered on the global environment.


You can confirm this by looking at the environment tab in RStudio after running the following:


Even though we are returning original_a, only its content is returned.

When R sees objects that are not provided explicitly as arguments to the function, then R looks for them in the global environment:


Here, multiplier is provided as an argument to the function.

Try this:


Now, define multiplier on the global environment, and then run it again:

You can set default values for function arguments by argument = value like below:


Try this:

It is easy to create a function with multiple arguments. You just simply add more arguments within function() like below:


As you are likely to have noticed, the order of input arguments are assumed to be the same as the order of the arguments of the function. Above,

  • a = 4
  • b = 3

You can mess with the order of input arguments if you want if you name the input arguments as follows:

Define a function that takes temperature in Fahrenheit, convert it into Celsius, and return it.

Here is the formula: temp_C <- (temp_F - 32) * 5 / 9

Work here


Answer

Code
f_to_c <- function(temp_F) {
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}

After running a randomized nitrogen trial, you found the following relationship between corn yield (bu/acre) and nitrogen rate (lb/acre):

\[ \mbox{corn yield} = 120 + 25 \times log(\mbox{nitrogen rate}) \]

Write a function that takes a nitrogen rate as an argument, calculate the estimated yield for the nitrogen rate, and then return the estimated yield.

Work here


Answer

Code
n_to_yield <- function(N) {
  yield <- 120 + 25 * log(N)
  return(yield)
}

You would like to calculate the revenue of corn production (defined below) as a function of nitrogen rate based on the yield response function.

\[\begin{align*} P_{corn} * \mbox{corn yield} \end{align*}\]

Write a function that takes corn price and nitrogen rate as its arguments, calculate revenue (and yield as an intermediate step), and return revenue. In doing so, use the function your created in Exercise 2.

Work here


Answer

Code
calc_rev <- function(corn_price, N) {
  yield <- n_to_yield(N)
  revenue <- corn_price * yield
  return(revenue)
}

Loop


Loop

  • We sometimes need to run the same process over and over again often with slight changes in parameters.

  • In such a case, it is very time-consuming and messy to write all of the steps one bye one.

  • For example, suppose you are interested in knowing the square of 1 through 5 with a step of 1 ([1,2,3,4,5]). The following code certainly works:

1^2 
2^2 
3^2 
4^2 
5^2 
  • However, imagine you have to do this for 1000 integers.

  • Yes, you don’t want to write each one of them one by one as that would occupy 1000 lines of your code, and it would be time-consuming.

  • Things will be even worse if you need to repeat much more complicated processes like Monte Carlo simulations. So, let’s learn how to write a program to do repetitive jobs effectively using loop.

  • Looping is repeatedly evaluating the same (except parameters) process over and over again.

  • In the example above,

    • the same repeated process is the action of squaring
    • what you square (parameter) changes

Syntax

#--- NOT RUN ---#
for (x in a_list_of_values){
  you do what you want to do with x
} 


This does the same:

Write a for loop that cubes each element of the sequence of numbers that starts from 5 and increases up to 50 with the incremental step of 5.

Work here


Answer

Code
for (i in seq(5, 50, by = 5)){
  i^3
}

Looping with lapply()

Instead of using a for loop, we can use the lapply() function from the base package to loop.


Syntax

#--- NOT RUN ---#  
lapply(A, B) 
  • A is the list of values
  • B is the function you would like to apply to each of the values in A
    • You can use an existing function (e.g., mean())
    • You can also write a function on the spot


Note:

  • A is a vector, lapply() works on each of the vector elements
  • A is a list, lapply() works on each of the list elements whatever they may be
  • A is a data.frame, lapply() works on each of the columns (data.frame is a list of columns of equal length)

This does the same thing as the for loop example we looked at earlier:

  • \(x): shorthand for function(x)


The key difference from a for loop is the object class of the output after the loop.

Important: the output type of lappy() is always a list (that’s why it is called lapply)

It is often the case that you want to write a function of the action you intend to repeat first and then loop.

For example, for the loop of squaring numbers, you can first define a function that implements the action of squaring:


And then loop:

Often times, you would like to loop over a single parameter of a function that has multiple arguments:

For example, you would like to fix the value of b at 5 while trying different values of a of the following function:


Then you can do this:


As you can see, this function try each of 1:10 (called internally x), give it to square_them_add() as its first argument while b is fixed at 5.

Use lapply() to cube each element of the sequence of numbers that starts from 5 and increases up to 50 with the incremental step of 5.


Work here


Answer

Code
lapply(seq(5, 50, by = 5), \(x) x^3) 

Define a function that takes a nitrogen rate and corn price as arguments and calculate revenue. Yield can be estimated using the following equation:

\[ \mbox{corn yield} = 120 + 25 \times log(\mbox{nitrogen rate}) \]

At each value of the corn price sequence of seq(2.5, 4.0, by = 0.1), calculate the revenue using lapply() where nitrogen rate is fixed at 200 (lb/acre).


Work here


Answer

Code
n_to_yield <- function(N) {
  yield <- 120 + 25 * log(N)
  return(yield)
}

corn_price_seq <- seq(2.5, 4.0, by = 0.1)

lapply(corn_price_seq, \(x) n_to_yield(x))

Looping over multiple variables

  • The example we have looked at is a very simple case where a loop is done over a single list of values

  • It is often the case that you want to loop over multiple variables.

Example

You are interested in understanding the sensitivity of the profitability of corn production with respect to corn price and nitrogen application rate.

So, you would like to loop over two sets of sequences of values:

  • corn price
  • nitrogen application rate

How

The trick is to

  • create a data.frame of two (or as many variables as you would like to loop over) variables (corn price and nitrogen application rate), which stores all the permutations of the two variables

  • then loop over the rows of the data.frame

  • We are interested in understanding the sensitivity of corn revenue to corn price and applied nitrogen amount.

  • We consider

    • the range of $3.0/bu to $5.0/bu for corn price
    • 0 lb/acre to 300/acre for nitrogen rate

Get a sequence of values for corn price and nitrogen rate:


We then create a complete combination of the values using the expand.grid() function, and then convert it to a data.frame object (this is not strictly necessary).

Define a function that

  • takes a row number
  • refer to parameters_data to extract the parameters stored at the row number
  • calculate corn yield and revenue based on the extracted parameters (corn price and nitrogen rate).


This function

  • takes i (act as a row number within the function)
  • extract corn price and nitrogen from the ith row of parameters_mat
  • use the extracted values to calculate yield and revenue
  • create a data.frame of the resulting revenue, corn price, and nitrogen rate
  • returns the data.frame

Do a loop using lapply():

Combine the list of data.frames into a single data.frame using bind_rows() from the dplyr package.

Before define a function, write a code that works for one row.

We will work on a specific value of i. Here is it i = 1.


After you confirm the code you write gives you desired outcomes, make it a function by replacing 1 with i.

Find the profit of corn production at different price combinations of corn and nitrogen where nitrogen rate is fixed at 200 lb/acre.

  • Step 1: Define the following sequences of numbers
    • corn price: seq(2.5, 4.0, by = 0.05)
    • nitrogen price: seq(0.2, 0.6, by = 0.01)
  • Step 2: Create a data.frame of the complete combinations of the values from the price vectors
  • Step 3: Define a function that takes a row number, extract corn price and nitrogen price and then calculate profit based on the price combination using the following equations:

\[ \mbox{corn yield} = 120 + 25 \times log(\mbox{nitrogen rate}) \]

\[ \mbox{profit} = \mbox{corn price} \times \mbox{corn yield} - \mbox{nitrogen price} \times \mbox{nitrogen rate} \]

  • Step 4: loop over the row numbers of the parameter data

Do you really need to loop?

  • Actually, we should not have used a for loop or lapply() in any of the examples above in practice1

  • This is because they can be easily vectorized.

  • Vectorized operations are those that take vectors as inputs and work on each element of the vectors in parallel

Example

Vectorized


Non-vectorized (loop)


Compare

Both produce the same results. However, R is written in a way that is much better at doing vectorized operations.

Let’s time them using the microbenchmark() function from the microbenchmark package.

Here, we do not unlist() after lapply() to just focus on the multiplication part.


  • As you can see, the vectorized version is faster.
  • The time difference comes from R having to conduct many more internal checks and hidden operations for the non-vectorized one

Instead of this:


You can just do this:

Here is the vectorized version of the revenue sensitivity analysis:


Then use the function to calculate revenue and assign it to a new variable in the parameters_data data.

Let’s compare the vectorized and non-vectorized version:

Parallelized computation


Parallel processing

  • Parallelization of computation involves distributing the task at hand to multiple cores so that multiple processes are done in parallel.

  • Our focus is on the so called embarrassingly parallel processes.

Embarrassingly parallel process: a collection of processes where each process is completely independent of any another (one process does not use the outputs of any of the other processes)

  • The example of integer squaring is embarrassingly parallel. In order to calculate 12, you do not need to use the result of 22 or any other squares.

  • Embarrassingly parallel processes are very easy to parallelize because you do not have to worry about which process to complete first to make other processes happen.

  • Fortunately, most of the processes you are interested in parallelizing fall under this category

  • We will use the future_lapply() function from the future.apply package for parallelization.

  • Using the package, parallelization is a piece of cake as it is basically the same syntactically as lapply().

#--- install the package ---#
install.packages("future.apply") 

#--- load packages ---#
library(future.apply)


How

You can simply replace lapply() with future_lapply()!

#--- not parallelized ---#
sq_ls <- lapply(1:1000, function(x) x^2) 

#--- parallelized ---#
sq_ls_par <- future_lapply(1:1000, function(x) x^2) 

You can find out how many cores you have available for parallel computation on your computer using the detectCores() function from the parallel package.

#--- number of all cores ---#
parallel::detectCores()
[1] 20


Before we implement parallelized lapply(), we need to declare what backend process we will be using by plan().

plan(multicore, workers = parallel::detectCores() - 1)


Other backend processes are:

  • sequential: this is just a regular loop
  • multicore: forked sessions (not available on Windows)
  • multisession: multiple sessions (less performant thana multicore)

With the multicore option, R figure out which multicore or multisession should be used (or can be used) and automatically redirect the backend process to the appropriate (available) one.


Note

Unless you tell R explicitly to parallelize things (like using future_lapply()), R always uses a single core by default. So, you do not have to change anything manually when you do not want to use multiple cores.

sq_ls <- future_lapply(1:1000, function(x) x^2)
microbenchmark(

  #--- parallelized ---#
  "parallelized" = { sq_ls <- future_lapply(1:1000, function(x) x^2) }, 

  #--- non-parallelized ---#
  "not parallelized" = { sq_ls <- lapply(1:1000, function(x) x^2) },
  times = 100, 
  unit = "ms"

) 
Unit: milliseconds
             expr        min         lq       mean      median        uq
     parallelized 214.475455 252.632071 288.473396 278.4967715 321.18057
 not parallelized   0.579383   0.742785   1.089473   0.8525375   1.15469
       max neval
 467.03261   100
   7.06664   100
  • This is because communicating jobs to each core takes some time as well.

  • So, if each of the iterative processes is super fast (like this example where you just square a number), the time spent on communicating with the cores outweighs the time saving due to parallel computation.

  • Parallelization is more beneficial when each of the repetitive processes takes long.

Parallel processing: a less trivial example

  • One of the very good use cases of parallelization is MC simulation

  • We will run MC simulations that test whether the correlation between an independent variable and error term would cause bias (yes, we know the answer).

  1. generate a dataset (50,000 observations) according to the following data generating process:

\[ y = 1 + x + v \]

where + \(\mu \sim N(0,1)\) + \(x \sim N(0,1) + \mu\) + \(v \sim N(0,1) + \mu\).

The \(\mu\) term cause correlation between \(x\) (the covariate) and \(v\) (the error term).

  1. estimate the coefficient on \(x\) vis OLS, and return the estimate.

  2. repeat this process \(1,000\) times to understand the property of the OLS estimators under the data generating process.

This Monte Carlo simulation is embarrassingly parallel because each process is independent of any other.

Here is a function that implements the steps described in the previous slide:

#--- repeat steps 1-3 B times ---#
MC_sim <- function(i){

  N <- 50000 # sample size

  #--- steps 1 and 2:  ---#
  mu <- rnorm(N) # the common term shared by both x and u
  x <- rnorm(N) + mu # independent variable
  v <- rnorm(N) + mu # error
  y <- 1 + x + v # dependent variable
  data <- data.table(y = y, x = x)

  #--- OLS ---# 
  reg <- lm(y~x, data = data) # OLS

  #--- return the coef ---#
  return(reg$coef['x'])
}  

Single run:

tic()
single_res <- MC_sim(1)
toc()
0.014 sec elapsed

Not parallelized (sequential):

tic()
MC_results <- lapply(1:1000, MC_sim)
toc() 
18.541 sec elapsed

Parallelized:

tic()
MC_results <- future_lapply(1:1000, MC_sim)
toc() 
2.763 sec elapsed
  • For Mac or Linux users, parallel::mclapply() is just as compelling (or pbmclapply::pbmclapply() if you want to have a nice progress report, which is very helpful particularly when the process is long).

  • It is just as easy to use as future_lapply() because its syntax is the same as lapply().

  • You can control the number of cores to employ by adding mc.cores option. Here is an example code that does the same MC simulations we conducted above:

#--- mclapply ---#
library(parallel)
MC_results <- mclapply(1:1000, MC_sim, mc.cores = detectCores() - 1)

#--- or with progress bar ---#
library(pbmclapply)
MC_results <- pbmclapply(1:1000, MC_sim, mc.cores = detectCores() - 1)

Low-dimensional optimization


Low-dimensional optimization

Suppose you have ran an randomized nitrogen experiment for corn production on a field, collected data, and run a regression to find the following quantitative relationship between corn yield (bu/acre) and nitrogen rate (lb/acre):

\[ \mbox{corn yield} = 120 + 25 \times log(\mbox{nitrogen rate}) \]

Your are interested in finding the best nitrogen rates that maximize profit at different combinations of corn and nitrogen prices for this field.

\[ Max_{N} P_C \times Y(N) - P_N \times N \]

  • N: nitrogen rate (lb/acre)
  • Y(N): corn yield (bu/acre) as a function of N as described above
  • P_C: corn price ($/bu)
  • P_N: nitrogen price ($/lb)

Here, N is a decision variable, and P_C and P_N are parameters.

Grid search is a very inefficient yet effective tool for finding solutions to optimization problems as long as the dimension of the optimization problem is low (1 or 2 variables).

Grid search basically evaluates the objective function (profit here) at many levels of the decision variables (nitrogen here) and pick the one that maximizes the objective function (profit).


Example

Let’s define a function that takes N, P_C, and P_N values and returns profits.


Let’s create a sequence of N values at which we evaluate the profit, and then calculate profit at each level of N.

Here is the visualization of profit-N relationship:

Once the profit-N relationship is found, we can use dplyr::filter() combined with max() to identify the optimal N rate.


Alternatively, you can sort the data by profit in the ascending order (default) and pick the last row using dplyr::slice(n()).


This method is faster than the first one.

Coding strategy 1: looping

Now that you have written codes to find the optimal N at a given combination of corn and nitrogen prices.

We can move on to the next step of finding the optimal N rates at many various combinations of corn and nitrogen prices.

Here is the coding strategy:

  1. Define a set of all the combinations of corn and nitrogen prices you want to analyze as a data.frame.

  2. Define a function that extract corn and nitrogen prices from the parameter data.frame and find the optimal N rate at the given combination of price and nitrogen combination.

  3. Loop over the set of parameters.

Here, we define a data.frame of parameters to be explored. We will be looping over the rows of the parameter data.frame.

Now, we will define a function that extract a combination of corn and nitrogen prices from price_parameters (extract a row from price_parameters), and then find the optimal N.


Check if this function works:



Combine the list of data.frames into a single data.frame using bind_rows().

Coding strategy 2: vectorized

Instead of writing a loop like above, we can actually vectorize the process. Here are the steps:

  1. Define a set of all the combinations of nitrogen rate, corn price, and nitrogen price you want to analyze as a data.frame.

  2. Calculate profits for all the combinations of nitrogen rate, corn price, and nitrogen price inside the data.frame

  3. Find the optimal N rate for each of the combinations of nitrogen rate, corn price, and nitrogen price

Here, we define all the combinations of nitrogen rate, corn price, and nitrogen price you want to analyze as a data.frame.

Now, we will calculate profit for all the rows in price_parameters.


Now, we can identify the optimal N rate at each of the corn and nitrogen combinations:

Which strategy?

  • Which strategy you should take depends on the size of your computer’s RMA memory.

  • Going over the RAM memory limit will suffocate your computer, which leads to a substantial loss in computing performance.

  • Vectorized version is more memory-hungry:

    • Strategy 1: loop over the price combinations (one price combination at a time)
    • Strategy 2: loop over price (all the price combinations at the same time)
  • If you can fit the entire dataset in the RAM memory, then take Strategy 2. Otherwise, break up the entire task into pieces like Strategy 1.

Keep track of RAM memory usage

  • Mac users: go to Applications \(\rightarrow\) Utilities \(\rightarrow\) Activity Monitor

  • Windows users: press Windows Key + R \(\rightarrow\) type “resmon” into the search box

Optimization (fixed attribute combinations)

Suppose you have ran an randomized nitrogen experiment for corn production on a field, collected data, and run a regression to find the following quantitative relationship between corn yield (bu/acre) and nitrogen rate (lb/acre):

\[ \mbox{corn yield} = 120 + (EC/40) \times (1 + slope)\times 25 \times log(\mbox{nitrogen rate}) \]

You are interested in finding the best nitrogen rates that maximize profit for different parts of the field at a given corn and nitrogen price combination.

\[ Max_{N} P_C \times [120 + (EC/40) \times (1 + slope) * 25 \times log(\mbox{nitrogen rate})] - P_N \times N \]

  • N: nitrogen rate (lb/acre)
  • slope: the slope
  • EC: electrical conductivity
  • P_C: corn price ($/bu)
  • P_N: nitrogen price ($/lb)

Here, N is a decision variable, slope and EC are attributes, and P_C and P_N are parameters.

Data

Consider a 2-plot field like below for the sake of tractability:


Objective

You want to find the optimal nitrogen rate for each plot for a give combination of corn and nitrogen prices.

You can expand on all the variables, nitrogen rate (decision variable), slope and EC (attributes), and corn and nitrogen prices (parameters):


  • You only need the highlighted rows because no plots in this dataset has slope-ec combinations of c(0, 30) and c(0.2, 40)
  • You created unnecessary rows with this approach

Instead of using expand on all the three vectors using expand.grid(), we can use expand.grid.df() from the reshape package as follows.


As you can see, it creates unique complete comninations of the rows from the fisrt and second data.frames. Consequently, it does not create any observations that do not exist in reality.