6.2 Many multi-layer raster files

Here we discuss ways to parallelize the process of extracting values from many of multi-layer raster files.

6.2.1 Datasets

We will use the following datasets:

  • raster: daily PRISM data 2010 through 2019 stacked by month
  • polygons: US County polygons

daily PRISM precipitation 2010 through 2019

You can download all the prism files from here. For those who are interested in learning how to generate the series of daily PRISM data files stored by month, see section 9.3 for the code.

US counties

(
US_county <- st_as_sf(map(database = "county", plot = FALSE, fill = TRUE)) %>% 
  #--- get state name from ID  ---#
  mutate(state = str_split(ID, ",") %>% lapply(., `[[`, 1) %>% unlist) %>% 
  #--- project to the CRS of the CDL data ---#
  st_transform(projection(brick("Data/PRISM/PRISM_ppt_y2017_m7.tif"))) 
)
Simple feature collection with 3076 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -124.6813 ymin: 25.12993 xmax: -67.00742 ymax: 49.38323
CRS:           +proj=longlat +datum=NAD83 +no_defs
First 10 features:
                 ID                           geom   state
1   alabama,autauga MULTIPOLYGON (((-86.50517 3... alabama
2   alabama,baldwin MULTIPOLYGON (((-87.93757 3... alabama
3   alabama,barbour MULTIPOLYGON (((-85.42801 3... alabama
4      alabama,bibb MULTIPOLYGON (((-87.02083 3... alabama
5    alabama,blount MULTIPOLYGON (((-86.9578 33... alabama
6   alabama,bullock MULTIPOLYGON (((-85.66866 3... alabama
7    alabama,butler MULTIPOLYGON (((-86.8604 31... alabama
8   alabama,calhoun MULTIPOLYGON (((-85.74313 3... alabama
9  alabama,chambers MULTIPOLYGON (((-85.59416 3... alabama
10 alabama,cherokee MULTIPOLYGON (((-85.46812 3... alabama

6.2.2 Non-parallelized extraction

We have already learned in Chapter 5.3 that extracting values from stacked raster layers is faster than doing so from multiple single-layer raster datasets one at a time. Here, daily precipitation datasets are stacked by year-month and saved as multi-layer GeoTIFF files. For example, PRISM_ppt_y2009_m1.tif stores the daily precipitation data for January, 2009. This is how long it takes to extract values for US counties from a month’s of daily PRISM precipitation data.

tic()
temp <- exact_extract(stack("Data/PRISM/PRISM_ppt_y2009_m1.tif"), US_county, progress = F) 
toc()
elapsed 
 26.571 

Now, to process all the precipitation data from 2009-2018, we consider two approaches in this section are:

  1. parallelize over polygons (blocked) and do regular loop over year-month
  2. parallelize over year-month

6.2.3 Approach 1: parallelize over polygons and do regular loop over year-month

For this approach, let’s measure the time spent on processing one year-month PRISM dataset and then guess how long it would take to process 120 year-month PRISM datasets.

#--- number of polygons in a group ---#
num_in_group <- floor(nrow(US_county)/num_cores)

#--- define group id ---#
US_county <- US_county %>%  
  mutate(
    #--- create grid id ---#
    poly_id = 1:nrow(.),
    #--- assign group id  ---#
    group_id = poly_id %/% num_in_group + 1
  )

extract_by_group <- function(i){
  temp_return <- exact_extract(
    stack("Data/PRISM/PRISM_ppt_y2009_m1.tif"), 
    filter(US_county, group_id == i)
    ) %>% 
    #--- combine the list of data.frames into one with polygon id ---#
    rbindlist(idcol = "id_within_group") %>% 
    #--- find the count of land use type values by polygon ---#
    melt(id.var = c("id_within_group", "coverage_fraction")) %>% 
    .[, sum(value * coverage_fraction)/sum(coverage_fraction), by = .(id_within_group, variable)]

  return(temp_return)
}

tic()
temp <- mclapply(1:num_cores, extract_by_group, mc.cores = num_cores)
toc()
elapsed 
 83.694 

Okay, so this approach does not really help. If we are to process 10 years of daily PRISM data, then it would take roughly 167.39 minutes.

6.2.4 Approach 2: parallelize over the temporal dimension (year-month)

Instead of parallelize over polygons, let’s parallelize over time (year-month). To do so, we first create a data.frame that has all the year-month combinations we will work on.

(
month_year_data <- expand.grid(month  = 1:12, year = 2009:2018) %>% 
  data.table()
)

The following function extract data from a single year-month case:

get_prism_by_month <- function(i, vector){

  temp_month <- month_year_data[i, month] # month to work on
  temp_year <- month_year_data[i, year] # year to work on

  #--- import raster data ---#
  temp_raster <- stack(paste0("Data/PRISM/PRISM_ppt_y", temp_year, "_m", temp_month, ".tif")) 

  temp <- exact_extract(temp_raster, vector) %>% 
    #--- combine the extraction results into one data.frame ---#
    rbindlist(idcol = "row_id") %>% 
    #--- wide to long ---#
    melt(id.var = c("row_id", "coverage_fraction")) %>% 
    #--- find coverage-weighted average ---#
    .[, sum(value*coverage_fraction)/sum(coverage_fraction), by = .(row_id, variable)]

  return(temp)

  gc()
}

We then loop over the rows of month_year_data in parallel.

tic()
temp <- mclapply(1:nrow(month_year_data), function(x) get_prism_by_month(x, US_county), mc.cores = num_cores)
toc()
elapsed 
451.477 

It took 7.52 minutes. So, Approach 2 is the clear winner.

6.2.5 Memory consideration

So far, we have paid no attention to the memory footprint of the parallelized processes. But, it is crucial when parallelizing many large datasets. Approaches 1 and 2 differ substantially in their memory footprints.

Approach 1 divides the polygons into a group of polygons and parallelizes over the groups when extracting raster values. Approach 2 extracts and holds raster values for 15 of the whole U.S. polygons. So, Approach 1 clearly has a lesser memory footprint. Approach 2 used about 40 Gb of the computer’s memory, almost maxing out the 64 Gb RAM memory of my computer (it’s not just R or C++ that are consuming RAM memory at the time). If you do not go over the limit, it is perfectly fine. Approach 2 is definitely a better option for me. However, if I had 32 Gb RAM memory, Approach 2 would have suffered a significant loss in its performance, while Approach 1 would not have. Or, if the raster data had twice as many cells with the same spatial extent, then Approach 2 would have suffered a significant loss in its performance, while Approach 1 would not have.

It is easy to come up with a case where Approach 1 is preferable. For example, suppose you have multiple 10-Gb raster layers and your computer has 16 Gb RAM memory. Then, Approach 2 clearly does not work, and Approach 1 is your only choice, which is better than not parallelizing at all.

In summary, while letting each core process a larger amount of data, you need to be careful not to exceed the RAM memory limit of your computer.