09-4-2: R as GIS: Interaction of Vector Datasets II

Before you start


Learning objectives

The objectives of this chapter is to learn spatial interactive operations that involves two sf objects. Specifically,

  • overlay an sf object on another sf object to extract (or join) values from the sf object


Tips to make the most of the lecture notes


Interactive navigation tools

  • Click on the three horizontally stacked lines at the bottom left corner of the slide, then you will see table of contents, and you can jump to the section you want

  • Hit letter “o” on your keyboard and you will have a panel view of all the slides


Running and writing codes

  • The box area with a hint of blue as the background color is where you can write code (hereafter referred to as the “code area”).
  • Hit the “Run Code” button to execute all the code inside the code area.
  • You can evaluate (run) code selectively by highlighting the parts you want to run and hitting Command + Enter for Mac (Ctrl + Enter for Windows).
  • If you want to run the codes on your computer, you can first click on the icon with two sheets of paper stacked on top of each other (top right corner of the code chunk), which copies the code in the code area. You can then paste it onto your computer.
  • You can click on the reload button (top right corner of the code chunk, left to the copy button) to revert back to the original code.

Spatial join

Spatial join involves all of the following:

  • overlay one spatial layer (target layer) onto another spatial layer (source layer)
  • for each of the observation in the target layer
    • identify which objects in the source layer it geographically intersects (or a different topological relation) with
    • extract values associated with the intersecting objects in the source layer (and summarize if necessary),
    • assign the extracted value to the object in the target layer

Definitions: target layer

The sf layer that has sfgs to which you would like to assign the values of attributes from the source layer.

Definitions: source layer

The sf layer that has sfgs from which you would like to assign the values of its attributes to the target layer.

We can classify spatial join into four categories by the type of the underlying spatial objects:

  • vector-vector: vector data (target) against vector data (source)
  • vector-raster: vector data (target) against raster data (source)
  • raster-vector: raster data (target) against vector data (source)
  • raster-raster: raster data (target) against raster data (source)

Among the four, our focus here is the first case (the second case will be discussed later).

We will not cover the third and fourth cases in this course because it is almost always the case that our target data is a vector data (e.g., city or farm fields as points, political boundaries as polygons, etc).

As noted earlier, we will look at vector-vector interactions in this lecture.

This category can be further broken down into different sub categories depending on the type of spatial object (point, line, and polygon).

Here, we will ignore any spatial joins that involve lines. This is because objects represented by lines are rarely observation units in our analysis nor the source data from which we will extract values.

Here is the list of the types of spatial joins we will learn.

  1. points (target) against polygons (source)
  2. polygons (target) against points (source)
  3. polygons (target) against polygons (source)

Spatial join: points (target) vs polygons (source)

What?

For each of the observations (points) in the target points data,

  • finds which polygon in the source polygons data it intersects with
  • assigns the value associated with the intersected polygon to the point

How?

In order to achieve this, we can use the st_join() function, whose syntax is as follows:

st_join(points_sf, polygons_sf)


Note

Similar to spatial sub-setting, the default topological relation is st_intersects()

We use wells_ne (points) and ne_counties (polygons) data for illustration. Here is the map:

For each of the points (wells) in wells_ne, the code below will find the polygon (county) in which the point (well) is located, and attach the value of the variables of the polygon (county) to the point (well).

Evaluate wells_joined_with_county and you will see that statefp, countyfp, and name variables are appended.

Let’s check if the two datasets are indeed joined based on their spatial locations.

Spatial join: polygons (target) vs points (source)

What?

For each of the observations (polygons) in the target data,

  • find which points in the source file it intersects
  • assign the values associated with the points to the polygon.

How?

In order to achieve this, we can use the st_join() function, whose syntax is as follows:

st_join(polygons_sf, points_sf)


Note

Similar to spatial sub-setting, the default topological relation is st_intersects()

We use ne_counties (polygons) and wells_ne (points) data for illustration. Here is the map:


We create a fake variable that represents groundwater extraction (acre-feet) from the aquifer.


Our goal is to find average groundwater extraction by county.

For each of the polygons (counties) in ne_counties, the code below will find all the points (wells) that are located inside the county, and attach the value of the variables of the points (wells) to the polygon (county).

Evaluate county_joined_with_wells and you will see that wellid and gw_extraction variables are appended.


One thing that is different from the previous case is that

  • For each of the polygons (counties), the resulting dataset has as many observations as the number of wells that intersect with the polygon (county).

  • If a polygon has no wells inside, then you will simply have a single row of data for that polygon.

For example, countyfp of 039 and 109 (first two rows), there are no wells inside them. So, we only have a single row with wellid and gw_extraction missing. But, for countyfp of 129, we have many wells inside it.


All the rows there have exactly the same geometry, which is the MULTIPOLYGON that represents the boundary of the Adams county.

Since we joined the two layers, we can now do calculations that were not possible before. Here, we will calculate the average groundwater extraction by county.


  • dplyr::summarize() takes a long time if it is applied to an sf object (this was not the case before).
  • as.data.frame() converts county_joined_with_wells into a data.frame and that saves lots of time in doing dplyr::summarize().

Of course, it is just as easy to get other types of statistics by simply modifying the summarize() part.

Spatial join: polygons (target) vs polygons (source)

For each of the observations (polygons) in the target data,

  • find which observations (polygons) in the source data it intersects
  • then, assign the values associated with the intersecting polygons from the source data to the polygon

Nitrogen use (lb/acre) by county in Iowa (Note: this is a fake dataset that is generated using R):


Hydrologic units that cover Iowa:

You are interested in understanding the impact of nitrogen use for agricultural production on water quality.

  • You observe water quality for each of the hydrologic units (huc_ia)
  • You observe nitrogen use (lb/acre) at the county level (ia_nitrogen)

You would like to associate nitrogen use values with water quality values so that you can run statistical analysis on the impact of corn production on water quality.

Let’s join the two:


Here, for each of the HUC units from huc_ia, all the intersecting counties from ia_nitrogen are matched.

For example, for HUC_CODE == "07060004", seven counties intersect with it.


The geometry column of the four rows has exactly the same geometry, which represents the HUC unit with HUC_CODE == 07060004.

We can now find average nitrogen use (lb/acre) by HUC unit:

Important

Note that the resulting dataset does not tell you the nature of intersections. The only thing we know from huc_joined_with_acres is which counties the HUC units are intersecting with no matter how small or large the overlapping area are.



We simply take the average of the value of nitrogen_rate of the intersecting counties. But, this does not take into account the degree of the overlaps between the intersecting counties and the HUC unit. Later, we will talk about how to find area-weighted average of nitrogen_rate this section.

Spatial Join: other topological relations

Spatial join with spatial_join() uses st_intersects() as the default topological relationship for joining.

Syntax

st_join(sf_1, sf_2, join = \(x, y) st_*(x, y), dist = 5)

where st_* is the function that determines the topological relationships between sf_1 and sf_2.

We use soy bean yield (points) data and as-applied seed rate (points) data.

You have run on-farm field experiment to understand the impact of seed rate on soybean yield. They are available as points data.


You want to merge them together based on their proximity so that you can run statistical analysis. Specifically, for each of the yield points, we would like to link the seed rate points that are within 10-meter from the yield point.

Let’s join using st_join():


We can now summarize the joined data like below:


  • According to the join, the 1st yield point from soy_yield did not have any seed rate data points from as_applied_s_rate within its 10 meter radius, so NA in seed_rate.n

  • The second yield point from soy_yield are matched with two seed rate points from as_applied_s_rate: seed_id = 1 and seed_id = 558. Are they indeed less than 10 meter away from the second yield point?


Spatial join and summary in one step with aggregate()

In the example of finding average groundwater use by county (go here to remind yourself of this example), we took a two-step procedure to do so.

  • spatial-join two layers with st_join()
  • apply dplyr::summarize() to the joined object

However, this can actually be done in one step using aggregate(), in which you specify how you want to aggregate with the FUN option:


Syntax

aggregate(points_sf, polygons_sf, FUN = function)


Here, for each of the rows of the second sf (here, polygons_sf), all the intersecting points in points_sf are found and then the average of all the columns of points_sf are calculated. Yes, st_intersects() is the default topological relationship just like spatial sub-setting with sf1[sf2, ] and st_join().


Note that wellid was also averaged by county. We could just do this:

aggregate() is a fairly general procedure of spatial joining and summarization, and you can use it for many other cases including our example of soybean yield and seed rate (go here for the example).

Here is the code:


  • This looks almost identical with the code to spatial join the two sf layers. However, the order of the sf objects is flipped.

  • You are aggregating the columns of first sf for each of the geometry in the second sf (here yield point).

Cropping Join

  • In the example of finding total corn production acreage by HUC unit using st_join(), we had a problem of not knowing how much of each of the intersecting counties shares the same area with the HUC unit.

  • If we can get the geometry of the intersecting part of the HUC unit and the county, then we can calculate its area, which in turn allows us to find area-weighted averages of joined attributes.

For these purposes, we can use sf::st_intersection().

  • While st_intersects() returns the indices of intersecting objects, st_intersection() returns intersecting spatial objects with the non-intersecting parts of the sf objects cut out.

  • Moreover, attribute values of the source sf will be merged to its intersecting sfg in the target sf.

The following code gets the intersection of the lines and the polygons.


Here is how the lines and polygons look like:

Here is how the intersections and polygons look like.

The following code gets the intersection of polygon 1 and polygon 3 with polygon 2. Each instance of the intersections of polygons 1 and 3 against polygon 2 becomes an observation (polygon 1-polygon 2 and polygon 3-polygon 2).


Just like the lines-polygons case, the non-intersecting part of polygons 1 and 3 are cut out and do not remain in the returned sf.

Let’s now get back to the example of HUC units and county-level nitrogen use data. We would like to find area-weighted average of nitrogen use instead of the simple average.

Using st_intersection(), for each of the HUC polygons, we can find the intersecting counties, and then divide it into parts based on the boundary of the intersecting polygons.


The key difference from the st_join() example is that each observation of the returned data is a unique HUC-county intersection.

The figure below is a map of all the intersections of the HUC unit with HUC_CODE == 07060004 and the seven intersecting counties.

We use poygon_1 and polygon_2 in this exercise. Inspect them to familiarize yourself with them:


Here is what they look like:

Find the intersection of the two polygons using st_intersection().


Answer

Code
intersection <- st_intersection(polygon_2, polygon_1)

Find the area-weighted average of value for polygon_2 based on the area of overlaps with the polygons in polygon_1?


Answer

Code
intersection %>%
  dplyr::mutate(area = as.numeric(st_area(geometry))) %>%
  dplyr::summarize(value = sum(value * area) / sum(area))