The objectives of this chapter is to learn spatial interactive operations that involves two sf
objects. Specifically,
sf
object on another sf
object to extract (or join) values from the sf
objectClick on the three horizontally stacked lines at the bottom left corner of the slide, then you will see table of contents, and you can jump to the section you want
Hit letter “o” on your keyboard and you will have a panel view of all the slides
Spatial join involves all of the following:
Definitions: target layer
The sf
layer that has sfg
s to which you would like to assign the values of attributes from the source layer.
Definitions: source layer
The sf
layer that has sfg
s from which you would like to assign the values of its attributes to the target layer.
We can classify spatial join into four categories by the type of the underlying spatial objects:
Among the four, our focus here is the first case (the second case will be discussed later).
We will not cover the third and fourth cases in this course because it is almost always the case that our target data is a vector data (e.g., city or farm fields as points, political boundaries as polygons, etc).
As noted earlier, we will look at vector-vector interactions in this lecture.
This category can be further broken down into different sub categories depending on the type of spatial object (point, line, and polygon).
Here, we will ignore any spatial joins that involve lines. This is because objects represented by lines are rarely observation units in our analysis nor the source data from which we will extract values.
Here is the list of the types of spatial joins we will learn.
What?
For each of the observations (points) in the target points data,
How?
In order to achieve this, we can use the st_join()
function, whose syntax is as follows:
Note
Similar to spatial sub-setting, the default topological relation is st_intersects()
We use wells_ne
(points) and ne_counties
(polygons) data for illustration. Here is the map:
For each of the points (wells) in wells_ne
, the code below will find the polygon (county) in which the point (well) is located, and attach the value of the variables of the polygon (county) to the point (well).
Evaluate wells_joined_with_county
and you will see that statefp
, countyfp
, and name
variables are appended.
Let’s check if the two datasets are indeed joined based on their spatial locations.
What?
For each of the observations (polygons) in the target data,
How?
In order to achieve this, we can use the st_join()
function, whose syntax is as follows:
Note
Similar to spatial sub-setting, the default topological relation is st_intersects()
We use ne_counties
(polygons) and wells_ne
(points) data for illustration. Here is the map:
We create a fake variable that represents groundwater extraction (acre-feet) from the aquifer.
Our goal is to find average groundwater extraction by county.
For each of the polygons (counties) in ne_counties
, the code below will find all the points (wells) that are located inside the county, and attach the value of the variables of the points (wells) to the polygon (county).
Evaluate county_joined_with_wells
and you will see that wellid
and gw_extraction
variables are appended.
One thing that is different from the previous case is that
For each of the polygons (counties), the resulting dataset has as many observations as the number of wells that intersect with the polygon (county).
If a polygon has no wells inside, then you will simply have a single row of data for that polygon.
For example, countyfp
of 039 and 109 (first two rows), there are no wells inside them. So, we only have a single row with wellid
and gw_extraction
missing. But, for countyfp
of 129, we have many wells inside it.
All the rows there have exactly the same geometry, which is the MULTIPOLYGON
that represents the boundary of the Adams county.
Since we joined the two layers, we can now do calculations that were not possible before. Here, we will calculate the average groundwater extraction by county.
dplyr::summarize()
takes a long time if it is applied to an sf
object (this was not the case before).as.data.frame()
converts county_joined_with_wells
into a data.frame
and that saves lots of time in doing dplyr::summarize()
.Of course, it is just as easy to get other types of statistics by simply modifying the summarize()
part.
For each of the observations (polygons) in the target data,
Nitrogen use (lb/acre) by county in Iowa (Note: this is a fake dataset that is generated using R):
Hydrologic units that cover Iowa:
You are interested in understanding the impact of nitrogen use for agricultural production on water quality.
huc_ia
)ia_nitrogen
)You would like to associate nitrogen use values with water quality values so that you can run statistical analysis on the impact of corn production on water quality.
Let’s join the two:
Here, for each of the HUC units from huc_ia
, all the intersecting counties from ia_nitrogen
are matched.
For example, for HUC_CODE == "07060004"
, seven counties intersect with it.
The geometry column of the four rows has exactly the same geometry, which represents the HUC unit with HUC_CODE == 07060004
.
We can now find average nitrogen use (lb/acre) by HUC unit:
Important
Note that the resulting dataset does not tell you the nature of intersections. The only thing we know from huc_joined_with_acres
is which counties the HUC units are intersecting with no matter how small or large the overlapping area are.
We simply take the average of the value of nitrogen_rate
of the intersecting counties. But, this does not take into account the degree of the overlaps between the intersecting counties and the HUC unit. Later, we will talk about how to find area-weighted average of nitrogen_rate
this section.
Spatial join with spatial_join()
uses st_intersects()
as the default topological relationship for joining.
Syntax
where st_*
is the function that determines the topological relationships between sf_1
and sf_2
.
We use soy bean yield (points) data and as-applied seed rate (points) data.
You have run on-farm field experiment to understand the impact of seed rate on soybean yield. They are available as points data.
You want to merge them together based on their proximity so that you can run statistical analysis. Specifically, for each of the yield points, we would like to link the seed rate points that are within 10-meter from the yield point.
Let’s join using st_join()
:
We can now summarize the joined data like below:
According to the join, the 1st yield point from soy_yield
did not have any seed rate data points from as_applied_s_rate
within its 10 meter radius, so NA
in seed_rate
.n
The second yield point from soy_yield
are matched with two seed rate points from as_applied_s_rate
: seed_id
= 1 and seed_id
= 558. Are they indeed less than 10 meter away from the second yield point?
aggregate()
In the example of finding average groundwater use by county (go here to remind yourself of this example), we took a two-step procedure to do so.
st_join()
dplyr::summarize()
to the joined objectHowever, this can actually be done in one step using aggregate()
, in which you specify how you want to aggregate with the FUN
option:
Syntax
Here, for each of the rows of the second sf
(here, polygons_sf
), all the intersecting points in points_sf
are found and then the average of all the columns of points_sf
are calculated. Yes, st_intersects()
is the default topological relationship just like spatial sub-setting with sf1[sf2, ]
and st_join()
.
Note that wellid
was also averaged by county. We could just do this:
aggregate()
is a fairly general procedure of spatial joining and summarization, and you can use it for many other cases including our example of soybean yield and seed rate (go here for the example).
Here is the code:
This looks almost identical with the code to spatial join the two sf
layers. However, the order of the sf
objects is flipped.
You are aggregating the columns of first sf
for each of the geometry in the second sf
(here yield point).
In the example of finding total corn production acreage by HUC unit using st_join()
, we had a problem of not knowing how much of each of the intersecting counties shares the same area with the HUC unit.
If we can get the geometry of the intersecting part of the HUC unit and the county, then we can calculate its area, which in turn allows us to find area-weighted averages of joined attributes.
For these purposes, we can use sf::st_intersection()
.
While st_intersects()
returns the indices of intersecting objects, st_intersection()
returns intersecting spatial objects with the non-intersecting parts of the sf
objects cut out.
Moreover, attribute values of the source sf
will be merged to its intersecting sfg
in the target sf
.
The following code gets the intersection of the lines and the polygons.
Here is how the lines and polygons look like:
Here is how the intersections and polygons look like.
The following code gets the intersection of polygon 1 and polygon 3 with polygon 2. Each instance of the intersections of polygons 1 and 3 against polygon 2 becomes an observation (polygon 1-polygon 2
and polygon 3-polygon 2
).
Just like the lines-polygons case, the non-intersecting part of polygons 1 and 3 are cut out and do not remain in the returned sf
.
Let’s now get back to the example of HUC units and county-level nitrogen use data. We would like to find area-weighted average of nitrogen use instead of the simple average.
Using st_intersection()
, for each of the HUC polygons, we can find the intersecting counties, and then divide it into parts based on the boundary of the intersecting polygons.
The key difference from the st_join()
example is that each observation of the returned data is a unique HUC-county intersection.
The figure below is a map of all the intersections of the HUC unit with HUC_CODE == 07060004
and the seven intersecting counties.
We use poygon_1
and polygon_2
in this exercise. Inspect them to familiarize yourself with them:
Here is what they look like:
Find the intersection of the two polygons using st_intersection()
.
Answer