Click on the three horizontally stacked lines at the bottom left corner of the slide, then you will see table of contents, and you can jump to the section you want
Hit letter “o” on your keyboard and you will have a panel view of all the slides
lapply()
to complete repetitive jobsfuture_lapply()
function from the future.apply
packageIt is beneficial to write your own function when you expect to repeat the same action with different inputs to the action.
Example: mean()
Calculating the average of an variable is such a common task
sum(x)/length(x)
every time you get a meanmean()
functionA function is more useful when the task is longer and more complicated.
Here is the general structure of a user-defined function:
Then, you can use the function like this:
Note: the argument does not have to be named x
The following function takes numeric numbers, square them, and return the squared values:
Try the function:
Any objects that are defined within a function are not created on the global environment.
For example, squared_a
and original_a
are defined only internally, and are not registered on the global environment.
You can confirm this by looking at the environment tab in RStudio after running the following:
Even though we are returning original_a
, only its content is returned.
When R sees objects that are not provided explicitly as arguments to the function, then R looks for them in the global environment:
Here, multiplier
is provided as an argument to the function.
Try this:
Now, define multiplier
on the global environment, and then run it again:
You can set default values for function arguments by argument = value
like below:
Try this:
It is easy to create a function with multiple arguments. You just simply add more arguments within function()
like below:
As you are likely to have noticed, the order of input arguments are assumed to be the same as the order of the arguments of the function. Above,
a = 4
b = 3
You can mess with the order of input arguments if you want if you name the input arguments as follows:
Define a function that takes temperature in Fahrenheit, convert it into Celsius, and return it.
Here is the formula: temp_C <- (temp_F - 32) * 5 / 9
Work here
Answer
After running a randomized nitrogen trial, you found the following relationship between corn yield (bu/acre) and nitrogen rate (lb/acre):
\[ \mbox{corn yield} = 120 + 25 \times log(\mbox{nitrogen rate}) \]
Write a function that takes a nitrogen rate as an argument, calculate the estimated yield for the nitrogen rate, and then return the estimated yield.
Work here
Answer
You would like to calculate the revenue of corn production (defined below) as a function of nitrogen rate based on the yield response function.
\[\begin{align*} P_{corn} * \mbox{corn yield} \end{align*}\]Write a function that takes corn price and nitrogen rate as its arguments, calculate revenue (and yield as an intermediate step), and return revenue. In doing so, use the function your created in Exercise 2.
Work here
Answer
We sometimes need to run the same process over and over again often with slight changes in parameters.
In such a case, it is very time-consuming and messy to write all of the steps one bye one.
For example, suppose you are interested in knowing the square of 1 through 5 with a step of 1 ([1,2,3,4,5]). The following code certainly works:
However, imagine you have to do this for 1000 integers.
Yes, you don’t want to write each one of them one by one as that would occupy 1000 lines of your code, and it would be time-consuming.
Things will be even worse if you need to repeat much more complicated processes like Monte Carlo simulations. So, let’s learn how to write a program to do repetitive jobs effectively using loop.
Looping is repeatedly evaluating the same (except parameters) process over and over again.
In the example above,
Syntax
This does the same:
lapply()
Instead of using a for
loop, we can use the lapply()
function from the base package to loop.
Syntax
A
is the list of valuesB
is the function you would like to apply to each of the values in A
mean()
)Note:
A
is a vector, lapply()
works on each of the vector elementsA
is a list, lapply()
works on each of the list elements whatever they may beA
is a data.frame
, lapply()
works on each of the columns (data.frame
is a list of columns of equal length)This does the same thing as the for loop example we looked at earlier:
\(x)
: shorthand for function(x)
The key difference from a for loop is the object class of the output after the loop.
Important: the output type of lappy()
is always a list
(that’s why it is called lapply
)
It is often the case that you want to write a function of the action you intend to repeat first and then loop.
For example, for the loop of squaring numbers, you can first define a function that implements the action of squaring:
And then loop:
Often times, you would like to loop over a single parameter of a function that has multiple arguments:
For example, you would like to fix the value of b
at 5 while trying different values of a
of the following function:
Then you can do this:
As you can see, this function try each of 1:10
(called internally x
), give it to square_them_add()
as its first argument while b
is fixed at 5.
Use lapply()
to cube each element of the sequence of numbers that starts from 5 and increases up to 50 with the incremental step of 5.
Work here
Answer
Define a function that takes a nitrogen rate and corn price as arguments and calculate revenue. Yield can be estimated using the following equation:
\[ \mbox{corn yield} = 120 + 25 \times log(\mbox{nitrogen rate}) \]
At each value of the corn price sequence of seq(2.5, 4.0, by = 0.1)
, calculate the revenue using lapply()
where nitrogen rate is fixed at 200 (lb/acre).
Work here
Answer
The example we have looked at is a very simple case where a loop is done over a single list of values
It is often the case that you want to loop over multiple variables.
Example
You are interested in understanding the sensitivity of the profitability of corn production with respect to corn price and nitrogen application rate.
So, you would like to loop over two sets of sequences of values:
How
The trick is to
create a data.frame
of two (or as many variables as you would like to loop over) variables (corn price and nitrogen application rate), which stores all the permutations of the two variables
then loop over the rows of the data.frame
We are interested in understanding the sensitivity of corn revenue to corn price and applied nitrogen amount.
We consider
Get a sequence of values for corn price and nitrogen rate:
We then create a complete combination of the values using the expand.grid()
function, and then convert it to a data.frame
object (this is not strictly necessary).
Define a function that
parameters_data
to extract the parameters stored at the row numberThis function
i
(act as a row number within the function)i
th row of parameters_mat
data.frame
of the resulting revenue, corn price, and nitrogen ratedata.frame
Do a loop using lapply()
:
Combine the list of data.frame
s into a single data.frame
using bind_rows()
from the dplyr
package.
Before define a function, write a code that works for one row.
We will work on a specific value of i
. Here is it i = 1
.
After you confirm the code you write gives you desired outcomes, make it a function by replacing 1
with i
.
Find the profit of corn production at different price combinations of corn and nitrogen where nitrogen rate is fixed at 200 lb/acre.
seq(2.5, 4.0, by = 0.05)
seq(0.2, 0.6, by = 0.01)
\[ \mbox{corn yield} = 120 + 25 \times log(\mbox{nitrogen rate}) \]
\[ \mbox{profit} = \mbox{corn price} \times \mbox{corn yield} - \mbox{nitrogen price} \times \mbox{nitrogen rate} \]
Actually, we should not have used a for loop or lapply()
in any of the examples above in practice1
This is because they can be easily vectorized.
Vectorized operations are those that take vectors as inputs and work on each element of the vectors in parallel
Example
Vectorized
Non-vectorized (loop)
Compare
Both produce the same results. However, R is written in a way that is much better at doing vectorized operations.
Let’s time them using the microbenchmark()
function from the microbenchmark
package.
Here, we do not unlist()
after lapply()
to just focus on the multiplication part.
Instead of this:
You can just do this:
Here is the vectorized version of the revenue sensitivity analysis:
Then use the function to calculate revenue and assign it to a new variable in the parameters_data data.
Let’s compare the vectorized and non-vectorized version:
Parallelization of computation involves distributing the task at hand to multiple cores so that multiple processes are done in parallel.
Our focus is on the so called embarrassingly parallel processes.
Embarrassingly parallel process: a collection of processes where each process is completely independent of any another (one process does not use the outputs of any of the other processes)
The example of integer squaring is embarrassingly parallel. In order to calculate 12, you do not need to use the result of 22 or any other squares.
Embarrassingly parallel processes are very easy to parallelize because you do not have to worry about which process to complete first to make other processes happen.
Fortunately, most of the processes you are interested in parallelizing fall under this category
We will use the future_lapply()
function from the future.apply
package for parallelization.
Using the package, parallelization is a piece of cake as it is basically the same syntactically as lapply()
.
How
You can simply replace lapply()
with future_lapply()
!
You can find out how many cores you have available for parallel computation on your computer using the detectCores()
function from the parallel
package.
Before we implement parallelized lapply()
, we need to declare what backend process we will be using by plan()
.
Other backend processes are:
sequential
: this is just a regular loopmulticore
: forked sessions (not available on Windows)multisession
: multiple sessions (less performant thana multicore
)With the multicore
option, R figure out which multicore
or multisession
should be used (or can be used) and automatically redirect the backend process to the appropriate (available) one.
Note
Unless you tell R explicitly to parallelize things (like using future_lapply()
), R always uses a single core by default. So, you do not have to change anything manually when you do not want to use multiple cores.
microbenchmark(
#--- parallelized ---#
"parallelized" = { sq_ls <- future_lapply(1:1000, function(x) x^2) },
#--- non-parallelized ---#
"not parallelized" = { sq_ls <- lapply(1:1000, function(x) x^2) },
times = 100,
unit = "ms"
)
Unit: milliseconds
expr min lq mean median uq
parallelized 214.475455 252.632071 288.473396 278.4967715 321.18057
not parallelized 0.579383 0.742785 1.089473 0.8525375 1.15469
max neval
467.03261 100
7.06664 100
This is because communicating jobs to each core takes some time as well.
So, if each of the iterative processes is super fast (like this example where you just square a number), the time spent on communicating with the cores outweighs the time saving due to parallel computation.
Parallelization is more beneficial when each of the repetitive processes takes long.
One of the very good use cases of parallelization is MC simulation
We will run MC simulations that test whether the correlation between an independent variable and error term would cause bias (yes, we know the answer).
\[ y = 1 + x + v \]
where + \(\mu \sim N(0,1)\) + \(x \sim N(0,1) + \mu\) + \(v \sim N(0,1) + \mu\).
The \(\mu\) term cause correlation between \(x\) (the covariate) and \(v\) (the error term).
estimate the coefficient on \(x\) vis OLS, and return the estimate.
repeat this process \(1,000\) times to understand the property of the OLS estimators under the data generating process.
This Monte Carlo simulation is embarrassingly parallel because each process is independent of any other.
Here is a function that implements the steps described in the previous slide:
#--- repeat steps 1-3 B times ---#
MC_sim <- function(i){
N <- 50000 # sample size
#--- steps 1 and 2: ---#
mu <- rnorm(N) # the common term shared by both x and u
x <- rnorm(N) + mu # independent variable
v <- rnorm(N) + mu # error
y <- 1 + x + v # dependent variable
data <- data.table(y = y, x = x)
#--- OLS ---#
reg <- lm(y~x, data = data) # OLS
#--- return the coef ---#
return(reg$coef['x'])
}
Single run:
Not parallelized (sequential):
Parallelized:
For Mac or Linux users, parallel::mclapply()
is just as compelling (or pbmclapply::pbmclapply()
if you want to have a nice progress report, which is very helpful particularly when the process is long).
It is just as easy to use as future_lapply()
because its syntax is the same as lapply()
.
You can control the number of cores to employ by adding mc.cores
option. Here is an example code that does the same MC simulations we conducted above:
Suppose you have ran an randomized nitrogen experiment for corn production on a field, collected data, and run a regression to find the following quantitative relationship between corn yield (bu/acre) and nitrogen rate (lb/acre):
\[ \mbox{corn yield} = 120 + 25 \times log(\mbox{nitrogen rate}) \]
Your are interested in finding the best nitrogen rates that maximize profit at different combinations of corn and nitrogen prices for this field.
\[ Max_{N} P_C \times Y(N) - P_N \times N \]
N
: nitrogen rate (lb/acre)Y(N)
: corn yield (bu/acre) as a function of N
as described aboveP_C
: corn price ($/bu)P_N
: nitrogen price ($/lb)Here, N
is a decision variable, and P_C
and P_N
are parameters.
Grid search is a very inefficient yet effective tool for finding solutions to optimization problems as long as the dimension of the optimization problem is low (1 or 2 variables).
Grid search basically evaluates the objective function (profit here) at many levels of the decision variables (nitrogen here) and pick the one that maximizes the objective function (profit).
Example
Let’s define a function that takes N
, P_C
, and P_N
values and returns profits.
Let’s create a sequence of N
values at which we evaluate the profit, and then calculate profit at each level of N
.
Here is the visualization of profit-N relationship:
Once the profit-N relationship is found, we can use dplyr::filter()
combined with max()
to identify the optimal N rate.
Alternatively, you can sort the data by profit in the ascending order (default) and pick the last row using dplyr::slice(n())
.
This method is faster than the first one.
Now that you have written codes to find the optimal N at a given combination of corn and nitrogen prices.
We can move on to the next step of finding the optimal N rates at many various combinations of corn and nitrogen prices.
Here is the coding strategy:
Define a set of all the combinations of corn and nitrogen prices you want to analyze as a data.frame
.
Define a function that extract corn and nitrogen prices from the parameter data.frame
and find the optimal N rate at the given combination of price and nitrogen combination.
Loop over the set of parameters.
Here, we define a data.frame
of parameters to be explored. We will be looping over the rows of the parameter data.frame
.
Now, we will define a function that extract a combination of corn and nitrogen prices from price_parameters
(extract a row from price_parameters
), and then find the optimal N.
Check if this function works:
Combine the list of data.frame
s into a single data.frame
using bind_rows()
.
Instead of writing a loop like above, we can actually vectorize the process. Here are the steps:
Define a set of all the combinations of nitrogen rate, corn price, and nitrogen price you want to analyze as a data.frame
.
Calculate profits for all the combinations of nitrogen rate, corn price, and nitrogen price inside the data.frame
Find the optimal N rate for each of the combinations of nitrogen rate, corn price, and nitrogen price
Here, we define all the combinations of nitrogen rate, corn price, and nitrogen price you want to analyze as a data.frame
.
Now, we will calculate profit for all the rows in price_parameters
.
Now, we can identify the optimal N rate at each of the corn and nitrogen combinations:
Which strategy you should take depends on the size of your computer’s RMA memory.
Going over the RAM memory limit will suffocate your computer, which leads to a substantial loss in computing performance.
Vectorized version is more memory-hungry:
If you can fit the entire dataset in the RAM memory, then take Strategy 2. Otherwise, break up the entire task into pieces like Strategy 1.
Keep track of RAM memory usage
Mac users: go to Applications \(\rightarrow\) Utilities \(\rightarrow\) Activity Monitor
Windows users: press Windows Key
+ R
\(\rightarrow\) type “resmon” into the search box
Suppose you have ran an randomized nitrogen experiment for corn production on a field, collected data, and run a regression to find the following quantitative relationship between corn yield (bu/acre) and nitrogen rate (lb/acre):
\[ \mbox{corn yield} = 120 + (EC/40) \times (1 + slope)\times 25 \times log(\mbox{nitrogen rate}) \]
You are interested in finding the best nitrogen rates that maximize profit for different parts of the field at a given corn and nitrogen price combination.
\[ Max_{N} P_C \times [120 + (EC/40) \times (1 + slope) * 25 \times log(\mbox{nitrogen rate})] - P_N \times N \]
N
: nitrogen rate (lb/acre)slope
: the slopeEC
: electrical conductivityP_C
: corn price ($/bu)P_N
: nitrogen price ($/lb)Here, N
is a decision variable, slope
and EC
are attributes, and P_C
and P_N
are parameters.
Data
Consider a 2-plot field like below for the sake of tractability:
Objective
You want to find the optimal nitrogen rate for each plot for a give combination of corn and nitrogen prices.
You can expand on all the variables, nitrogen rate (decision variable), slope and EC (attributes), and corn and nitrogen prices (parameters):
slope
-ec
combinations of c(0, 30)
and c(0.2, 40)
Instead of using expand on all the three vectors using expand.grid()
, we can use expand.grid.df()
from the reshape
package as follows.
As you can see, it creates unique complete comninations of the rows from the fisrt and second data.frame
s. Consequently, it does not create any observations that do not exist in reality.