2 Vector Data Handling with sf
Before you start
In this chapter we learn how to use the sf
package to handle and operate on spatial datasets. The sf
package uses the class of simple feature (sf
)35 for spatial objects in R. We first learn how sf
objects store and represent spatial datasets. We then move on to the following practical topics:
- read and write a shapefile and spatial data in other formats (and why you might not want to use the shapefile system any more, but use other alternative formats)
- project and reproject spatial objects
- convert
sf
objects intosp
objects, vice versa - confirm that
dplyr
works well withsf
objects - implement non-interactive (does not involve two
sf
objects) geometric operations onsf
objects- create buffers
- find the area of polygons
- find the centroid of polygons
- calculate the length of lines
sf
or sp
?
The sf
package was designed to replace the sp
package, which has been one of the most popular and powerful spatial packages in R for more than a decade. It has been about four years since the sf
package was first registered on CRAN. A couple of years back, many other spatial packages did not have support for the package yet. In this blog post the author responded to the questions of whether one should learn sp
or sf
saying,
“That’s a tough question. If you have time, I would say, learn to use both. sf is pretty new, so a lot of packages that depend on spatial classes still rely on sp. So you will need to know sp if you want to do any integration with many other packages, including raster (as of March 2018).
However, in the future we should see an increasing shift toward the sf package and greater use of sf classes in other packages. I also think that sf is easier to learn to use than sp.”
The future has come, and it’s not a tough question anymore. I cannot think of any major spatial packages that do not support sf
package, and sf
has largely becomes the standard for handling vector data in \(R\)36. Thus, this lecture note does not cover how to use sp
at all.
sf
has several advantages over the sp
package (Pebesma 2018).37 First, it cut off the tie that sp
had with ESRI shapefile system, which has a somewhat loose way of representing spatial data. Instead, it uses simple feature access, which is an open standard supported by Open Geospatial Consortium (OGC). Another important benefit is its compatibility with the tidyverse
package, which includes widely popular packages like ggplot2
and dplyr
. Consequently, map-making with ggplot()
and data wrangling with a family of dplyr
functions come very natural to many \(R\) users. sp
objects have different slots for spatial information and attributes data, and they are not amenable to dplyr
way of data transformation.
Direction for replication
Datasets
All the datasets that you need to import are available here. In this chapter, the path to files is set relative to my own working directory (which is hidden). To run the codes without having to mess with paths to the files, follow these steps:
- set a folder (any folder) as the working directory using
setwd()
- create a folder called “Data” inside the folder designated as the working directory (if you have created a “Data” folder to replicate demonstrations in Chapter 1, then skip this step)
- download the pertinent datasets from here
- place all the files in the downloaded folder in the “Data” folder
Packages
Run the following code to install or load (if already installed) the pacman
package, and then install or load (if already installed) the listed package inside the pacman::p_load()
function.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
sf, # vector data operations
dplyr, # data wrangling
data.table, # data wrangling
tmap, # make maps
mapview, # create an interactive map
patchwork # arranging maps
)
2.1 Spatial Data Structure
Here we learn how the sf
package stores spatial data along with the definition of three key sf
object classes: simple feature geometry (sfg
), simple feature geometry list-column (sfc
), and simple feature (sf
). The sf
package provides a simply way of storing geographic information and the attributes of the geographic units in a single dataset. This special type of dataset is called simple feature (sf
). It is best to take a look at an example to see how this is achieved. We use North Carolina county boundaries with county attributes (Figure 2.1).
#--- a dataset that comes with the sf package ---#
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
Reading layer `nc' from data source
`/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/sf/shape/nc.shp'
using driver `ESRI Shapefile'
Simple feature collection with 100 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS: NAD27
As you can see below, this dataset is of class sf
(and data.frame
at the same time).
class(nc)
[1] "sf" "data.frame"
Now, let’s take a look inside of nc
.
#--- take a look at the data ---#
head(nc)
Simple feature collection with 6 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
Geodetic CRS: NAD27
AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1
2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0
3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5
4 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1
5 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9
6 0.097 1.670 1833 1833 Hertford 37091 37091 46 1452 7
NWBIR74 BIR79 SID79 NWBIR79 geometry
1 10 1364 0 19 MULTIPOLYGON (((-81.47276 3...
2 10 542 3 12 MULTIPOLYGON (((-81.23989 3...
3 208 3616 6 260 MULTIPOLYGON (((-80.45634 3...
4 123 830 2 145 MULTIPOLYGON (((-76.00897 3...
5 1066 1606 3 1197 MULTIPOLYGON (((-77.21767 3...
6 954 1838 5 1237 MULTIPOLYGON (((-76.74506 3...
Just like a regular data.frame
, you see a number of variables (attributes) except that you have a variable called geometry
at the end. Each row represents a single geographic unit (here, county). Ashe County (1st row) has area of \(0.114\), FIPS code of \(37009\), and so on. And the entry in geometry
column at the first row represents the geographic information of Ashe County. An entry in the geometry
column is a simple feature geometry (sfg
), which is an \(R\) object that represents the geographic information of a single geometric feature (county in this example). There are different types of sfg
s (POINT
, LINESTRING
, POLYGON
, MULTIPOLYGON
, etc). Here, sfg
s representing counties in NC are of type MULTIPOLYGON
. Let’s take a look inside the sfg
for Ashe County using st_geometry()
.
st_geometry(nc[1, ])[[1]][[1]]
[[1]]
[,1] [,2]
[1,] -81.47276 36.23436
[2,] -81.54084 36.27251
[3,] -81.56198 36.27359
[4,] -81.63306 36.34069
[5,] -81.74107 36.39178
[6,] -81.69828 36.47178
[7,] -81.70280 36.51934
[8,] -81.67000 36.58965
[9,] -81.34530 36.57286
[10,] -81.34754 36.53791
[11,] -81.32478 36.51368
[12,] -81.31332 36.48070
[13,] -81.26624 36.43721
[14,] -81.26284 36.40504
[15,] -81.24069 36.37942
[16,] -81.23989 36.36536
[17,] -81.26424 36.35241
[18,] -81.32899 36.36350
[19,] -81.36137 36.35316
[20,] -81.36569 36.33905
[21,] -81.35413 36.29972
[22,] -81.36745 36.27870
[23,] -81.40639 36.28505
[24,] -81.41233 36.26729
[25,] -81.43104 36.26072
[26,] -81.45289 36.23959
[27,] -81.47276 36.23436
As you can see, the sfg
consists of a number of points (pairs of two numbers). Connecting the points in the order they are stored delineates the Ashe County boundary.
plot(st_geometry(nc[1, ]))
We will take a closer look at different types of sfg
in the next section.
Finally, the geometry
variable is a list of individual sfg
s, called simple feature geometry list-column (sfc
).
dplyr::select(nc, geometry)
Simple feature collection with 100 features and 0 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS: NAD27
First 10 features:
geometry
1 MULTIPOLYGON (((-81.47276 3...
2 MULTIPOLYGON (((-81.23989 3...
3 MULTIPOLYGON (((-80.45634 3...
4 MULTIPOLYGON (((-76.00897 3...
5 MULTIPOLYGON (((-77.21767 3...
6 MULTIPOLYGON (((-76.74506 3...
7 MULTIPOLYGON (((-76.00897 3...
8 MULTIPOLYGON (((-76.56251 3...
9 MULTIPOLYGON (((-78.30876 3...
10 MULTIPOLYGON (((-80.02567 3...
Elements of a geometry list-column are allowed to be different in nature from other elements38. In the nc
data, all the elements (sfg
s) in geometry
column are MULTIPOLYGON
. However, you could also have LINESTRING
or POINT
objects mixed with MULTIPOLYGONS
objects in a single sf
object if you would like.
2.2 Simple feature geometry, simple feature geometry list-column, and simple feature
Here, we learn how different types of sfg
are constructed. We also learn how to create sfc
and sf
from sfg
from scratch.39
2.2.1 Simple feature geometry (sfg
)
The sf
package uses a class of sfg
(simple feature geometry) objects to represent a geometry of a single geometric feature (say, a city as a point, a river as a line, county and school district as polygons). There are different types of sfg
s. Here are some example feature types that we commonly encounter as an economist40:
-
POINT
: area-less feature that represents a point (e.g., well, city, farmland) -
LINESTRING
: (e.g., a tributary of a river) -
MULTILINESTRING
: (e.g., river with more than one tributary) -
POLYGON
: geometry with a positive area (e.g., county, state, country) -
MULTIPOLYGON
: collection of polygons to represent a single object (e.g., countries with islands: U.S., Japan)
POINT
is the simplest geometry type and is represented by a vector of two41 numeric values. An example below shows how a POINT
feature can be made from scratch:
#--- create a POINT ---#
a_point <- st_point(c(2, 1))
The st_point()
function creates a POINT
object when supplied with a vector of two numeric values. If you check the class of the newly created object,
#--- check the class of the object ---#
class(a_point)
[1] "XY" "POINT" "sfg"
you can see that it’s indeed a POINT
object. But, it’s also an sfg
object. So, a_point
is an sfg
object of type POINT
.
A LINESTRING
objects are represented by a sequence of points:
#--- collection of points in a matrix form ---#
s1 <- rbind(c(2, 3), c(3, 4), c(3, 5), c(1, 5))
#--- see what s1 looks like ---#
s1
[,1] [,2]
[1,] 2 3
[2,] 3 4
[3,] 3 5
[4,] 1 5
#--- create a "LINESTRING" ---#
a_linestring <- st_linestring(s1)
#--- check the class ---#
class(a_linestring)
[1] "XY" "LINESTRING" "sfg"
s1
is a matrix where each row represents a point. By applying st_linestring()
function to s1
, you create a LINESTRING
object. Let’s see what the line looks like.
plot(a_linestring)
As you can see, each pair of consecutive points in the matrix are connected by a straight line to form a line.
A POLYGON
is very similar to LINESTRING
in the manner it is represented.
#--- collection of points in a matrix form ---#
p1 <- rbind(c(0, 0), c(3, 0), c(3, 2), c(2, 5), c(1, 3), c(0, 0))
#--- see what s1 looks like ---#
p1
[,1] [,2]
[1,] 0 0
[2,] 3 0
[3,] 3 2
[4,] 2 5
[5,] 1 3
[6,] 0 0
#--- create a "POLYGON" ---#
a_polygon <- st_polygon(list(p1))
#--- check the class ---#
class(a_polygon)
[1] "XY" "POLYGON" "sfg"
#--- see what it looks like ---#
plot(a_polygon)
Just like the LINESTRING
object we created earlier, a POLYGON
is represented by a collection of points. The biggest difference between them is that we need to have some positive area enclosed by lines connecting the points. To do that, you have the the same point for the first and last points to close the loop: here, it’s c(0,0)
. A POLYGON
can have holes in it. The first matrix of a list becomes the exterior ring, and all the subsequent matrices will be holes within the exterior ring.
#--- a hole within p1 ---#
p2 <- rbind(c(1, 1), c(1, 2), c(2, 2), c(1, 1))
#--- create a polygon with hole ---#
a_plygon_with_a_hole <- st_polygon(list(p1, p2))
#--- see what it looks like ---#
plot(a_plygon_with_a_hole)
You can create a MULTIPOLYGON
object in a similar manner. The only difference is that you supply a list of lists of matrices, with each inner list representing a polygon. An example below:
#--- second polygon ---#
p3 <- rbind(c(4, 0), c(5, 0), c(5, 3), c(4, 2), c(4, 0))
#--- create a multipolygon ---#
a_multipolygon <- st_multipolygon(list(list(p1, p2), list(p3)))
#--- see what it looks like ---#
plot(a_multipolygon)
Each of list(p1,p2)
and list(p3)
represents a polygon. You supply a list of these lists to the st_multipolygon()
function to make a MULTIPOLYGON
object.
2.2.2 Create simple feature geometry list-column (sfc
) and simple feature (sf
) from scratch
To make a simple feature geometry list-column (sfc
), you can simply supply a list of sfg
to the st_sfc()
function as follows:
#--- create an sfc ---#
sfc_ex <- st_sfc(list(a_point, a_linestring, a_polygon, a_multipolygon))
To create an sf
object, you first add an sfc
as a column to a data.frame
.
#--- create a data.frame ---#
df_ex <- data.frame(name = c("A", "B", "C", "D"))
#--- add the sfc as a column ---#
df_ex$geometry <- sfc_ex
#--- take a look ---#
df_ex
name geometry
1 A POINT (2 1)
2 B LINESTRING (2 3, 3 4, 3 5, ...
3 C POLYGON ((0 0, 3 0, 3 2, 2 ...
4 D MULTIPOLYGON (((0 0, 3 0, 3...
At this point, it is not yet recognized as an sf
by R yet.
#--- see what it looks like (this is not an sf object yet) ---#
class(df_ex)
[1] "data.frame"
You can register it as an sf
object using st_as_sf()
.
#--- let R recognize the data frame as sf ---#
sf_ex <- st_as_sf(df_ex)
#--- see what it looks like ---#
sf_ex
Simple feature collection with 4 features and 1 field
Geometry type: GEOMETRY
Dimension: XY
Bounding box: xmin: 0 ymin: 0 xmax: 5 ymax: 5
CRS: NA
name geometry
1 A POINT (2 1)
2 B LINESTRING (2 3, 3 4, 3 5, ...
3 C POLYGON ((0 0, 3 0, 3 2, 2 ...
4 D MULTIPOLYGON (((0 0, 3 0, 3...
As you can see sf_ex
is now recognized also as an sf
object.
#--- check the class ---#
class(sf_ex)
[1] "sf" "data.frame"
2.3 Reading and writing vector data
The vast majority of people still use ArcGIS to handle spatial data, which has its own system of storing spatial data42 called shapefile. So, chances are that your collaborators use shapefiles. Moreover, there are many GIS data online that are available only as shapefiles. So, it is important to learn how to read and write shapefiles.
2.3.1 Reading a shapefile
We can use st_read()
to read a shapefile. It reads in a shapefile and then turn the data into an sf object. Let’s take a look at an example.
#--- read a NE county boundary shapefile ---#
nc_loaded <- st_read(dsn = "Data", "nc")
Typically, you have two arguments to specify for st_read()
. The first one is dsn
, which is basically the path to folder in which the shapefile you want to import is stored. The second one is the name of the shapefile. Notice that you do not add .shp
extension to the file name: nc
, not nc.shp
.43.
2.3.2 Writing to a shapefile
Writing an sf
object as a shapefile is just as easy. You use the st_write()
function, with the first argument being the name of the sf
object you are exporting, and the second being the name of the new shapefile. For example, the code below will export an sf
object called nc_loaded
as nc2.shp
(along with other supporting files).
st_write(
nc_loaded,
dsn = "Data",
layer = "nc2",
driver = "ESRI Shapefile",
append = FALSE
)
append = FALSE
forces writing the data when a file already exists with the same name. Without the option, this happens.
st_write(
nc_loaded,
dsn = "Data",
layer = "nc2",
driver = "ESRI Shapefile"
)
Layer nc2 in dataset Data already exists:
use either append=TRUE to append to layer or append=FALSE to overwrite layer
Error in eval(expr, envir, enclos): Dataset already exists.
2.3.3 Better alternatives
Now, if your collaborator is using ArcGIS and demanding that he/she needs a shapefile for his/her work, sure you can use the above command to write a shapefile. But, there is really no need to work with the shapefile system. One of the alternative data formats that is considered superior to the shapefile system is GeoPackage44, which overcomes various limitations associated with shapefile45. Unlike the shapefile system, it produces only a single file with .gpkg extension.46 Note that GeoPackage files can also be easily read into ArcGIS. So, it might be worthwhile to convince your collaborators to stop using shapefiles and start using GeoPackage.
#--- write as a gpkg file ---#
st_write(nc, dsn = "Data/nc.gpkg", append = FALSE)
#--- read a gpkg file ---#
nc <- st_read("Data/nc.gpkg")
Or better yet, if your collaborator uses R (or if it is only you who is going to use the data), then just save it as an rds file using saveRDS()
, which can be of course read using readRDS()
.
#--- save as an rds ---#
saveRDS(nc, "Data/nc_county.rds")
#--- read an rds ---#
nc <- readRDS("Data/nc_county.rds")
The use of rds files can be particularly attractive when the dataset is large because rds files are typically more memory efficient than shapefiles, eating up less of your disk memory.
As you can see here, it is a myth that spatial datasets have to be stored as shapefiles.
2.4 Projection with a different Coordinate Reference Systems
You often need to reproject an sf
using a different coordinate reference system (CRS) because you need it to have the same CRS as an sf
object that you are interacting it with (spatial join) or mapping it with. In order to check the current CRS for an sf
object, you can use the st_crs()
function.
st_crs(nc)
Coordinate Reference System:
User input: NAD27
wkt:
GEOGCRS["NAD27",
DATUM["North American Datum 1927",
ELLIPSOID["Clarke 1866",6378206.4,294.978698213898,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["latitude",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["longitude",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4267]]
wkt
stands for Well Known Text47, which is one of many many formats to store CRS information.48 4267 is the SRID (Spatial Reference System Identifier) defined by the European Petroleum Survey Group (EPSG) for the CRS49.
When you transform your sf
using a different CRS, you can use its EPSG number if the CRS has an EPSG number.50 Let’s transform the sf
to WGS 84
(another commonly used GCS), whose EPSG number is 4326. We can use the st_transform()
function to achieve that, with the first argument being the sf
object you are transforming and the second being the EPSG number of the new CRS.
#--- transform ---#
nc_wgs84 <- st_transform(nc, 4326)
#--- check if the transformation was successful ---#
st_crs(nc_wgs84)
Coordinate Reference System:
User input: EPSG:4326
wkt:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
Notice that wkt
was also altered accordingly to reflect the change in CRS: datum was changed to WGS 84. Now, let’s transform (reproject) the data using NAD83 / UTM zone 17N
CRS. Its EPSG number is \(26917\).51 So, the following code does the job.
#--- transform ---#
nc_utm17N <- st_transform(nc_wgs84, 26917)
#--- check if the transformation was successful ---#
st_crs(nc_utm17N)
Coordinate Reference System:
User input: EPSG:26917
wkt:
PROJCRS["NAD83 / UTM zone 17N",
BASEGEOGCRS["NAD83",
DATUM["North American Datum 1983",
ELLIPSOID["GRS 1980",6378137,298.257222101,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4269]],
CONVERSION["UTM zone 17N",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",0,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",-81,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.9996,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",500000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Engineering survey, topographic mapping."],
AREA["North America - between 84°W and 78°W - onshore and offshore. Canada - Nunavut; Ontario; Quebec. United States (USA) - Florida; Georgia; Kentucky; Maryland; Michigan; New York; North Carolina; Ohio; Pennsylvania; South Carolina; Tennessee; Virginia; West Virginia."],
BBOX[23.81,-84,84,-78]],
ID["EPSG",26917]]
As you can see in its CRS information, the projection system is now UTM zone 17N.
You often need to change the CRS of an sf
object when you interact (e.g., spatial subsetting, joining, etc) it with another sf
object. In such a case, you can extract the CRS of the other sf
object using st_crs()
and use it for transformation.52 So, you do not need to find the EPSG of the CRS of the sf
object you are interacting it with.
#--- transform ---#
nc_utm17N_2 <- st_transform(nc_wgs84, st_crs(nc_utm17N))
#--- check if the transformation was successful ---#
st_crs(nc_utm17N_2)
Coordinate Reference System:
User input: EPSG:26917
wkt:
PROJCRS["NAD83 / UTM zone 17N",
BASEGEOGCRS["NAD83",
DATUM["North American Datum 1983",
ELLIPSOID["GRS 1980",6378137,298.257222101,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4269]],
CONVERSION["UTM zone 17N",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",0,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",-81,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.9996,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",500000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Engineering survey, topographic mapping."],
AREA["North America - between 84°W and 78°W - onshore and offshore. Canada - Nunavut; Ontario; Quebec. United States (USA) - Florida; Georgia; Kentucky; Maryland; Michigan; New York; North Carolina; Ohio; Pennsylvania; South Carolina; Tennessee; Virginia; West Virginia."],
BBOX[23.81,-84,84,-78]],
ID["EPSG",26917]]
2.5 Quick and interactive view of an sf
object
2.5.1 Quick view
The easiest way to visualize an sf
object is to use plot()
:
plot(nc)
As you can see, plot()
create a map for each variable where the spatial units are color-differentiated based on the values of the variable. For creating more elaborate maps that are of publication-quality, see Chapter 8.
2.5.2 Interactive view
Sometimes it is useful to be able to tell where certain spatial objects are and what values are associated with them on a map. The tmap_leaflet()
function from the tmap
package can create an interactive map where you can point to a spatial object and the associated information is revealed on the map. Let’s use the North Carolina county map as an example here:
As you can see, if you put your cursor on a polygon (county) and click on it, then its information pops up.
Alternatively, you could use the tmap
package to create interactive maps. You can first create a static map following a syntax like this:
#--- NOT RUN (for polygons) ---#
tm_shape(sf) +
tm_polygons()
#--- NOT RUN (for points) ---#
tm_shape(sf) +
tm_symbols()
This creates a static map of nc
where county boundaries are drawn:
(
tm_nc_polygons <- tm_shape(nc) + tm_polygons()
)
Then, you can apply tmap_leaflet()
to the static map to have an interactive view of the map:
tmap_leaflet(tm_nc_polygons)
You could also change the view mode of tmap
objects to the view
mode using tmap_mode("view")
and then simply evaluate tm_nc_polygons
.
#--- change to the "view" mode ---#
tmap_mode("view")
#--- now you have an interactive biew ---#
tm_nc_polygons
Note that once you change the view mode to “view”, then the evaluation of all tmap
objects become interactive. I prefer the first option, as I need to revert the view mode back to “plot” by tmap_mode("plot")
if I don’t want interactive views.
2.6 Turning a data.frame of points into an sf
Often times, you have a dataset with geographic coordinates as variables in a csv or other formats, which would not be recognized as a spatial dataset by R immediately when it is read into R. In this case, you need to identify which variables represent the geographic coordinates from the data set, and create an sf
yourself. Fortunately, it is easy to do so using the st_as_sf()
function. Let’s first read a dataset (irrigation wells in Nebraska):
#--- read irrigation well registration data ---#
(
wells <- readRDS("Data/well_registration.rds")
)
Key: <wellid>
wellid ownerid nrdname acres regdate section longdd
<int> <int> <char> <num> <char> <int> <num>
1: 2 106106 Central Platte 160 12/30/55 3 -99.58401
2: 3 14133 South Platte 46 4/29/31 8 -102.62495
3: 4 14133 South Platte 46 4/29/31 8 -102.62495
4: 5 14133 South Platte 46 4/29/31 8 -102.62495
5: 6 15837 Central Platte 160 8/29/32 20 -99.62580
---
105818: 244568 135045 Upper Big Blue NA 8/26/16 30 -97.58872
105819: 244569 105428 Little Blue NA 8/26/16 24 -97.60752
105820: 244570 135045 Upper Big Blue NA 8/26/16 30 -97.58294
105821: 244571 135045 Upper Big Blue NA 8/26/16 25 -97.59775
105822: 244572 105428 Little Blue NA 8/26/16 15 -97.64086
latdd
<num>
1: 40.69825
2: 41.11699
3: 41.11699
4: 41.11699
5: 40.73268
---
105818: 40.89017
105819: 40.13257
105820: 40.88722
105821: 40.89639
105822: 40.13380
#--- check the class ---#
class(wells)
[1] "data.table" "data.frame"
As you can see the data is not an sf
object. In this dataset, longdd
and latdd
represent longitude and latitude, respectively. We now turn the dataset into an sf
object:
#--- recognize it as an sf ---#
wells_sf <- st_as_sf(wells, coords = c("longdd", "latdd"))
#--- take a look at the data ---#
head(wells_sf[, 1:5])
Simple feature collection with 6 features and 5 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -102.6249 ymin: 40.69824 xmax: -99.58401 ymax: 41.11699
CRS: NA
wellid ownerid nrdname acres regdate geometry
1 2 106106 Central Platte 160 12/30/55 POINT (-99.58401 40.69825)
2 3 14133 South Platte 46 4/29/31 POINT (-102.6249 41.11699)
3 4 14133 South Platte 46 4/29/31 POINT (-102.6249 41.11699)
4 5 14133 South Platte 46 4/29/31 POINT (-102.6249 41.11699)
5 6 15837 Central Platte 160 8/29/32 POINT (-99.6258 40.73268)
6 7 90248 Central Platte 120 2/15/35 POINT (-99.64524 40.73164)
Note that the CRS of wells_sf
is NA. Obviously, \(R\) does not know the reference system without you telling it. We know53 that the geographic coordinates in the wells data is NAD 83 (\(epsg=4269\)) for this dataset. So, we can assign the right CRS using either st_set_crs()
or st_crs()
.
#--- set CRS ---#
wells_sf <- st_set_crs(wells_sf, 4269)
#--- or this ---#
st_crs(wells_sf) <- 4269
#--- see the change ---#
head(wells_sf[, 1:5])
Simple feature collection with 6 features and 5 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -102.6249 ymin: 40.69824 xmax: -99.58401 ymax: 41.11699
Geodetic CRS: NAD83
wellid ownerid nrdname acres regdate geometry
1 2 106106 Central Platte 160 12/30/55 POINT (-99.58401 40.69825)
2 3 14133 South Platte 46 4/29/31 POINT (-102.6249 41.11699)
3 4 14133 South Platte 46 4/29/31 POINT (-102.6249 41.11699)
4 5 14133 South Platte 46 4/29/31 POINT (-102.6249 41.11699)
5 6 15837 Central Platte 160 8/29/32 POINT (-99.6258 40.73268)
6 7 90248 Central Platte 120 2/15/35 POINT (-99.64524 40.73164)
2.7 Conversion to and from sp objects
You may find instances where sp
objects are necessary or desirable.54 In that case, it is good to know how to convert an sf
object to an sp
object, vice versa. You can convert an sf
object to its sp
counterpart using as(sf_object, "Spatial")
:
#--- conversion ---#
wells_sp <- as(wells_sf, "Spatial")
#--- check the class ---#
class(wells_sp)
[1] "SpatialPointsDataFrame"
attr(,"package")
[1] "sp"
As you can see wells_sp
is a class of SpatialPointsDataFrame
, points with a data.frame
supported by the sp
package. The above syntax works for converting an sf
of polygons into SpatialPolygonsDataFrame
as well55.
You can revert wells_sp
back to an sf
object using the st_as_sf()
function, as follows:
#--- revert back to sf ---#
wells_sf <- st_as_sf(wells_sp)
#--- check the class ---#
class(wells_sf)
[1] "sf" "data.frame"
We do not cover how to use the sp
package as the benefit of learning it has become marginal compared to when sf
was just introduced a few years back56.
2.8 Non-spatial transformation of sf
2.8.1 Using dplyr
An important feature of an sf
object is that it is basically a data.frame
with geometric information stored as a variable (column). This means that transforming an sf
object works just like transforming a data.frame
. Basically, everything you can do to a data.frame
, you can do to an sf
as well. The code below just provides an example of basic operations including dplyr::select()
, dplyr::filter()
, and dplyr::mutate()
in action with an sf
object to just confirm that dplyr
operations works with an sf
object just like a data.frame
.
#--- here is what the data looks like ---#
dplyr::select(wells_sf, wellid, nrdname, acres, regdate, nrdname)
Simple feature collection with 105822 features and 4 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -104.0531 ymin: 40.00161 xmax: -96.87681 ymax: 41.85942
Geodetic CRS: NAD83
First 10 features:
wellid nrdname acres regdate geometry
1 2 Central Platte 160 12/30/55 POINT (-99.58401 40.69825)
2 3 South Platte 46 4/29/31 POINT (-102.6249 41.11699)
3 4 South Platte 46 4/29/31 POINT (-102.6249 41.11699)
4 5 South Platte 46 4/29/31 POINT (-102.6249 41.11699)
5 6 Central Platte 160 8/29/32 POINT (-99.6258 40.73268)
6 7 Central Platte 120 2/15/35 POINT (-99.64524 40.73164)
7 8 South Platte 113 8/7/37 POINT (-103.5257 41.24492)
8 10 South Platte 160 5/4/38 POINT (-103.0284 41.13243)
9 11 Middle Republican 807 5/6/38 POINT (-101.1193 40.3527)
10 12 Middle Republican 148 11/29/77 POINT (-101.1146 40.35631)
Notice that geometry
column will be retained after dplyr::select()
even if you did not tell R to keep it above.
Let’s apply dplyr::select()
, dplyr::filter()
, and dplyr::mutate()
to the dataset.
#--- do some transformations ---#
wells_sf %>%
#--- select variables (geometry will always remain after select) ---#
dplyr::select(wellid, nrdname, acres, regdate, nrdname) %>%
#--- removes observations with acre < 30 ---#
dplyr::filter(acres > 30) %>%
#--- hectare instead of acre ---#
dplyr::mutate(hectare = acres * 0.404686)
Simple feature collection with 63271 features and 5 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -104.0529 ymin: 40.00161 xmax: -96.87681 ymax: 41.73599
Geodetic CRS: NAD83
First 10 features:
wellid nrdname acres regdate geometry hectare
1 2 Central Platte 160 12/30/55 POINT (-99.58401 40.69825) 64.74976
2 3 South Platte 46 4/29/31 POINT (-102.6249 41.11699) 18.61556
3 4 South Platte 46 4/29/31 POINT (-102.6249 41.11699) 18.61556
4 5 South Platte 46 4/29/31 POINT (-102.6249 41.11699) 18.61556
5 6 Central Platte 160 8/29/32 POINT (-99.6258 40.73268) 64.74976
6 7 Central Platte 120 2/15/35 POINT (-99.64524 40.73164) 48.56232
7 8 South Platte 113 8/7/37 POINT (-103.5257 41.24492) 45.72952
8 10 South Platte 160 5/4/38 POINT (-103.0284 41.13243) 64.74976
9 11 Middle Republican 807 5/6/38 POINT (-101.1193 40.3527) 326.58160
10 12 Middle Republican 148 11/29/77 POINT (-101.1146 40.35631) 59.89353
Now, let’s try to get a summary of a variable by group using the group_by()
and summarize()
functions. Here, we use only the first 100 observations because dplyr::summarize()
takes just too long.
#--- summary by group ---#
wells_by_nrd <-
wells_sf[1:100, ] %>%
#--- group by nrdname ---#
dplyr::group_by(nrdname) %>%
#--- summarize ---#
dplyr::summarize(tot_acres = sum(acres, na.rm = TRUE))
#--- take a look ---#
wells_by_nrd
Simple feature collection with 8 features and 2 fields
Geometry type: MULTIPOINT
Dimension: XY
Bounding box: xmin: -103.6467 ymin: 40.0475 xmax: -97.89758 ymax: 41.24492
Geodetic CRS: NAD83
# A tibble: 8 × 3
nrdname tot_acres geometry
<chr> <dbl> <MULTIPOINT [°]>
1 Central Platte 6690. ((-100.174 40.95732), (-100.1032 41.01639), (-99.…
2 Lower Republican 350 ((-99.30403 40.05116), (-99.238 40.0475), (-99.21…
3 Middle Republican 1396 ((-101.1663 40.36351), (-101.1289 40.37464), (-10…
4 South Platte 1381 ((-103.6467 41.2375), (-103.6224 41.23759), (-103…
5 Tri-Basin 480 ((-98.94967 40.64285), (-98.94906 40.65061), (-98…
6 Twin Platte 1098. ((-101.9384 41.07452), (-101.9384 41.0709), (-101…
7 Upper Big Blue 330 ((-97.89998 40.86336), (-97.89758 40.86156))
8 Upper Republican 480 ((-101.8858 40.47776), (-101.6501 40.6012))
So, we can summarize an sf
by group using dplyr::group_by()
and dplyr::summarize()
. One interesting change that happened is the geometry variable. Each NRD now has multipoint
sfg, which is the combination of all the wells (points) located inside the NRD. Now, it is hard to imagine that you need summarized
geometries after group summary. Moreover, it is a very slow operation. If you have lots of free time, try running the above code with wells_sf
instead of wells_sf[1:100, ]
. I never waited for it to finish as it was running for a long long time. It is advised that you simply drop the geometry and turn the sf
object into a data.frame
(or tibble
, data.table
) first and then do group summary.
#--- remove geometry ---#
wells_no_longer_sf <- st_drop_geometry(wells_sf)
#--- take a look ---#
head(wells_no_longer_sf)
wellid ownerid nrdname acres regdate section
1 2 106106 Central Platte 160 12/30/55 3
2 3 14133 South Platte 46 4/29/31 8
3 4 14133 South Platte 46 4/29/31 8
4 5 14133 South Platte 46 4/29/31 8
5 6 15837 Central Platte 160 8/29/32 20
6 7 90248 Central Platte 120 2/15/35 19
We can now do a group summary very quickly:
wells_no_longer_sf %>%
#--- group by nrdname ---#
dplyr::group_by(nrdname) %>%
#--- summarize ---#
dplyr::summarize(tot_acres = sum(acres, na.rm = TRUE))
# A tibble: 9 × 2
nrdname tot_acres
<chr> <dbl>
1 Central Platte 1890918.
2 Little Blue 995900.
3 Lower Republican 543079.
4 Middle Republican 443472.
5 South Platte 216109.
6 Tri-Basin 847058.
7 Twin Platte 452678.
8 Upper Big Blue 1804782.
9 Upper Republican 551906.
2.8.2 Using data.table
The data.table
package provides data wrangling options that are extremely fast (see here for various benchmark results). It particularly shines when datasets are large and is much faster than dplyr
. However, it cannot be as naturally integrated into the workflow involving sf
objects as dplyr
can. Let’s convert an sf
object of points into a data.table
object using data.table()
.
#--- convert an sf to data.table ---#
(
wells_dt <- data.table(wells_sf)
)
wellid ownerid nrdname acres regdate section
<int> <int> <char> <num> <char> <int>
1: 2 106106 Central Platte 160 12/30/55 3
2: 3 14133 South Platte 46 4/29/31 8
3: 4 14133 South Platte 46 4/29/31 8
4: 5 14133 South Platte 46 4/29/31 8
5: 6 15837 Central Platte 160 8/29/32 20
---
105818: 244568 135045 Upper Big Blue NA 8/26/16 30
105819: 244569 105428 Little Blue NA 8/26/16 24
105820: 244570 135045 Upper Big Blue NA 8/26/16 30
105821: 244571 135045 Upper Big Blue NA 8/26/16 25
105822: 244572 105428 Little Blue NA 8/26/16 15
geometry
<sfc_POINT>
1: POINT (-99.58401 40.69825)
2: POINT (-102.6249 41.11699)
3: POINT (-102.6249 41.11699)
4: POINT (-102.6249 41.11699)
5: POINT (-99.6258 40.73268)
---
105818: POINT (-97.58872 40.89017)
105819: POINT (-97.60752 40.13257)
105820: POINT (-97.58294 40.88722)
105821: POINT (-97.59775 40.89639)
105822: POINT (-97.64086 40.1338)
#--- check the class ---#
class(wells_dt)
[1] "data.table" "data.frame"
You see that wells_dt
is no longer an sf
object, but the geometry column still remains in the data.
When you convert an sf
of polygons into a data.table
, then the geometry column appears to have lost the geometry information as all the entries are just <XY[number]>
.
(
nc_dt <- data.table(nc) %>% head()
)
AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
<num> <num> <num> <num> <char> <char> <num> <int> <num> <num>
1: 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1
2: 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0
3: 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5
4: 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1
5: 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9
6: 0.097 1.670 1833 1833 Hertford 37091 37091 46 1452 7
NWBIR74 BIR79 SID79 NWBIR79 geometry
<num> <num> <num> <num> <sfc_MULTIPOLYGON>
1: 10 1364 0 19 MULTIPOLYGON (((-81.47276 3...
2: 10 542 3 12 MULTIPOLYGON (((-81.23989 3...
3: 208 3616 6 260 MULTIPOLYGON (((-80.45634 3...
4: 123 830 2 145 MULTIPOLYGON (((-76.00897 3...
5: 1066 1606 3 1197 MULTIPOLYGON (((-77.21767 3...
6: 954 1838 5 1237 MULTIPOLYGON (((-76.74506 3...
But, the geometry information is still there as you can see from this:
#--- take a look at what's inside the geometry column ---#
nc_dt$geometry
Geometry set for 6 features
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS: NAD27
First 5 geometries:
#--- check the class of the geometry column ---#
class(nc_dt$geometry)
[1] "sfc_MULTIPOLYGON" "sfc"
If you try to run sf
operations on a data.table
obejct with a geometry colum, it will of course give you an error. Like this
st_buffer(nc_dt, dist = 2)
Error in UseMethod("st_buffer"): no applicable method for 'st_buffer' applied to an object of class "c('data.table', 'data.frame')"
However, you can apply sf
spatial operations only on the geometry like this:
nc_dt$geometry %>%
st_buffer(dist = 2) %>%
head()
Geometry set for 6 features
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -83.74064 ymin: 34.0729 xmax: -73.77316 ymax: 38.58965
Geodetic CRS: NAD27
First 5 geometries:
Finally, it is easy to revert a data.table
object back to an sf
object again by using the st_as_sf()
function.
#--- wells ---#
(
wells_sf_again <- st_as_sf(wells_dt)
)
Simple feature collection with 105822 features and 6 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -104.0531 ymin: 40.00161 xmax: -96.87681 ymax: 41.85942
Geodetic CRS: NAD83
First 10 features:
wellid ownerid nrdname acres regdate section
1 2 106106 Central Platte 160 12/30/55 3
2 3 14133 South Platte 46 4/29/31 8
3 4 14133 South Platte 46 4/29/31 8
4 5 14133 South Platte 46 4/29/31 8
5 6 15837 Central Platte 160 8/29/32 20
6 7 90248 Central Platte 120 2/15/35 19
7 8 48113 South Platte 113 8/7/37 28
8 10 17073 South Platte 160 5/4/38 2
9 11 98432 Middle Republican 807 5/6/38 36
10 12 79294 Middle Republican 148 11/29/77 31
geometry
1 POINT (-99.58401 40.69825)
2 POINT (-102.6249 41.11699)
3 POINT (-102.6249 41.11699)
4 POINT (-102.6249 41.11699)
5 POINT (-99.6258 40.73268)
6 POINT (-99.64524 40.73164)
7 POINT (-103.5257 41.24492)
8 POINT (-103.0284 41.13243)
9 POINT (-101.1193 40.3527)
10 POINT (-101.1146 40.35631)
#--- nc polygons ---#
(
nc_sf_again <- st_as_sf(nc_dt)
)
Simple feature collection with 6 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS: NAD27
AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1
2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0
3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5
4 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1
5 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9
6 0.097 1.670 1833 1833 Hertford 37091 37091 46 1452 7
NWBIR74 BIR79 SID79 NWBIR79 geometry
1 10 1364 0 19 MULTIPOLYGON (((-81.47276 3...
2 10 542 3 12 MULTIPOLYGON (((-81.23989 3...
3 208 3616 6 260 MULTIPOLYGON (((-80.45634 3...
4 123 830 2 145 MULTIPOLYGON (((-76.00897 3...
5 1066 1606 3 1197 MULTIPOLYGON (((-77.21767 3...
6 954 1838 5 1237 MULTIPOLYGON (((-76.74506 3...
So, this means that if you need fast data transformation, you can first turn an sf
to a data.table
, transform the data using the data.table
functionality, and then revert back to sf
. However, for most economists, the geometry variable itself is not of interest in the sense that it never enters econometric models. For most of us, the geographic information contained in the geometry variable is just a glue to tie two datasets together by geographic referencing. Once we get values of spatial variables of interest, there is no point in keeping your data as an sf object. Personally, whenever I no longer need to carry around the geometry variable, I immediately turn an sf object into a data.table
for fast data transformation especially when the data is large.
Those who know the dtplyr
package (it takes advantage of the speed of data.table
while you can keep using dplyr
syntax and functions) may wonder if it works well with sf
objects. Nope:
library(dtplyr)
#--- convert an "lazy" data.table ---#
wells_ldt <- lazy_dt(wells_sf)
#--- try ---#
st_buffer(wells_ldt, dist = 2)
Error in UseMethod("st_buffer"): no applicable method for 'st_buffer' applied to an object of class "c('dtplyr_step_first', 'dtplyr_step')"
By the way, this package is awesome if you really love dplyr
, but want the speed of data.table
. dtplyr
is of course slightly slower than data.table
because internal translations of dplyr
language to data.table
language have to happen first.57
2.9 Non-interactive geometrical operations
There are various geometrical operations that are particularly useful for economists. Here, some of the most commonly used geometrical operations are introduced58. You can see the practical use of some of these functions in Chapter ??.
2.9.1 st_buffer
st_buffer()
creates a buffer around points, lines, or the border of polygons.
Let’s create buffers around points. First, we read well locations data.
#--- read wells location data ---#
urnrd_wells_sf <-
readRDS("Data/urnrd_wells.rds") %>%
#--- project to UTM 14N WGS 84 ---#
st_transform(32614)
Here is the spatial distribution of the wells (Figure 2.4).
tm_shape(urnrd_wells_sf) +
tm_symbols(col = "red", size = 0.1) +
tm_layout(frame = FALSE)
Let’s create buffers around the wells.
#--- create a one-mile buffer around the wells ---#
wells_buffer <- st_buffer(urnrd_wells_sf, dist = 1600)
As you can see, there are many circles around wells with the radius of \(1,600\) meters (Figure 2.5).
tm_shape(wells_buffer) +
tm_polygons(alpha = 0) +
tm_shape(urnrd_wells_sf) +
tm_symbols(col = "red", size = 0.1) +
tm_layout(frame = NA)
A practical application of buffer creation can be seen in Chapter 1.1.
We now create buffers around polygons. First, read NE county boundary data and select three counties (Chase, Dundy, and Perkins).
NE_counties <-
readRDS("Data/NE_county_borders.rds") %>%
filter(NAME %in% c("Perkins", "Dundy", "Chase")) %>%
st_transform(32614)
Here is what they look like (Figure 2.6):
tm_shape(NE_counties) +
tm_polygons("NAME", palette = "RdYlGn", contrast = .3, title = "County") +
tm_layout(frame = NA)
The following code creates buffers around polygons (see the results in Figure 2.7):
NE_buffer <- st_buffer(NE_counties, dist = 2000)
tm_shape(NE_buffer) +
tm_polygons(col = "blue", alpha = 0.2) +
tm_shape(NE_counties) +
tm_polygons("NAME", palette = "RdYlGn", contrast = .3, title = "County") +
tm_layout(
legend.outside = TRUE,
frame = FALSE
)
For example, this can be useful to identify observations which are close to the border of political boundaries when you want to take advantage of spatial discontinuity of policies across adjacent political boundaries.
2.9.2 st_area
The st_area()
function calculates the area of polygons.
#--- generate area by polygon ---#
(
NE_counties <- mutate(NE_counties, area = st_area(NE_counties))
)
Simple feature collection with 3 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 239494.1 ymin: 4430632 xmax: 310778.1 ymax: 4543676
Projected CRS: WGS 84 / UTM zone 14N
STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD ALAND
1 31 135 00835889 0500000US31135 31135 Perkins 06 2287828025
2 31 029 00835836 0500000US31029 31029 Chase 06 2316533447
3 31 057 00835850 0500000US31057 31057 Dundy 06 2381956151
AWATER geometry area
1 2840176 MULTIPOLYGON (((243340.2 45... 2302174854 [m^2]
2 7978172 MULTIPOLYGON (((241201.4 44... 2316908196 [m^2]
3 3046331 MULTIPOLYGON (((240811.3 44... 2389890530 [m^2]
Now, as you can see below, the default class of the results of st_area()
is units
, which does not accept numerical operations.
class(NE_counties$area)
[1] "units"
So, let’s turn it into double.
(
NE_counties <- mutate(NE_counties, area = as.numeric(area))
)
Simple feature collection with 3 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 239494.1 ymin: 4430632 xmax: 310778.1 ymax: 4543676
Projected CRS: WGS 84 / UTM zone 14N
STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD ALAND
1 31 135 00835889 0500000US31135 31135 Perkins 06 2287828025
2 31 029 00835836 0500000US31029 31029 Chase 06 2316533447
3 31 057 00835850 0500000US31057 31057 Dundy 06 2381956151
AWATER geometry area
1 2840176 MULTIPOLYGON (((243340.2 45... 2302174854
2 7978172 MULTIPOLYGON (((241201.4 44... 2316908196
3 3046331 MULTIPOLYGON (((240811.3 44... 2389890530
st_area()
is useful when you want to find area-weighted average of characteristics after spatially joining two polygon layers using the st_intersection()
function (See Chapter 3.3.3).
2.9.3 st_centroid
The st_centroid()
function finds the centroid of each polygon.
#--- create centroids ---#
(
NE_centroids <- st_centroid(NE_counties)
)
Simple feature collection with 3 features and 10 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 271156.7 ymin: 4450826 xmax: 276594.1 ymax: 4525635
Projected CRS: WGS 84 / UTM zone 14N
STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD ALAND
1 31 135 00835889 0500000US31135 31135 Perkins 06 2287828025
2 31 029 00835836 0500000US31029 31029 Chase 06 2316533447
3 31 057 00835850 0500000US31057 31057 Dundy 06 2381956151
AWATER geometry area
1 2840176 POINT (276594.1 4525635) 2302174854
2 7978172 POINT (271469.9 4489429) 2316908196
3 3046331 POINT (271156.7 4450826) 2389890530
Here’s the map of the output (Figure 2.8).
tm_shape(NE_counties) +
tm_polygons() +
tm_shape(NE_centroids) +
tm_symbols(size = 0.5) +
tm_layout(
legend.outside = TRUE,
frame = FALSE
)
It can be useful when creating a map with labels because the centroid of polygons tend to be a good place to place labels (Figure 2.9).59
tm_shape(NE_counties) +
tm_polygons() +
tm_shape(NE_centroids) +
tm_text("NAME") +
tm_layout(
legend.outside = TRUE,
frame = FALSE
)
It may be also useful when you need to calculate the “distance” between polygons.
2.9.4 st_length
We can use st_length()
to calculate great circle distances60 of LINESTRING
and MULTILINESTRING
when they are represented in geodetic coordinates. On the other hand, if they are projected and use a Cartesian coordinate system, it will calculate Euclidean distance. We use U.S. railroad data for a demonstration.
#--- import US railroad data and take only the first 10 of it ---#
(
a_railroad <- rail_roads <- st_read(dsn = "Data", layer = "tl_2015_us_rails")[1:10, ]
)
Reading layer `tl_2015_us_rails' from data source
`/Users/tmieno2/Dropbox/TeachingUNL/R_as_GIS/Data' using driver `ESRI Shapefile'
Simple feature collection with 180958 features and 3 fields
Geometry type: MULTILINESTRING
Dimension: XY
Bounding box: xmin: -165.4011 ymin: 17.95174 xmax: -65.74931 ymax: 65.00006
Geodetic CRS: NAD83
Simple feature collection with 10 features and 3 fields
Geometry type: MULTILINESTRING
Dimension: XY
Bounding box: xmin: -79.74031 ymin: 35.0571 xmax: -79.2377 ymax: 35.51776
Geodetic CRS: NAD83
LINEARID FULLNAME MTFCC geometry
1 11020239500 Norfolk Southern Rlwy R1011 MULTILINESTRING ((-79.47058...
2 11020239501 Norfolk Southern Rlwy R1011 MULTILINESTRING ((-79.46687...
3 11020239502 Norfolk Southern Rlwy R1011 MULTILINESTRING ((-79.6682 ...
4 11020239503 Norfolk Southern Rlwy R1011 MULTILINESTRING ((-79.46687...
5 11020239504 Norfolk Southern Rlwy R1011 MULTILINESTRING ((-79.74031...
6 11020239575 Seaboard Coast Line RR R1011 MULTILINESTRING ((-79.43695...
7 11020239576 Seaboard Coast Line RR R1011 MULTILINESTRING ((-79.47852...
8 11020239577 Seaboard Coast Line RR R1011 MULTILINESTRING ((-79.43695...
9 11020239589 Aberdeen and Rockfish RR R1011 MULTILINESTRING ((-79.38736...
10 11020239591 Aberdeen and Briar Patch RR R1011 MULTILINESTRING ((-79.53848...
#--- check CRS ---#
st_crs(a_railroad)
Coordinate Reference System:
User input: NAD83
wkt:
GEOGCRS["NAD83",
DATUM["North American Datum 1983",
ELLIPSOID["GRS 1980",6378137,298.257222101,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["latitude",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["longitude",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4269]]
It uses geodetic coordinate system. Let’s calculate the great circle distance of the lines (Chapter ?? for a practical use case of this function).
(
a_railroad <- mutate(a_railroad, length = st_length(a_railroad))
)
Simple feature collection with 10 features and 4 fields
Geometry type: MULTILINESTRING
Dimension: XY
Bounding box: xmin: -79.74031 ymin: 35.0571 xmax: -79.2377 ymax: 35.51776
Geodetic CRS: NAD83
LINEARID FULLNAME MTFCC geometry
1 11020239500 Norfolk Southern Rlwy R1011 MULTILINESTRING ((-79.47058...
2 11020239501 Norfolk Southern Rlwy R1011 MULTILINESTRING ((-79.46687...
3 11020239502 Norfolk Southern Rlwy R1011 MULTILINESTRING ((-79.6682 ...
4 11020239503 Norfolk Southern Rlwy R1011 MULTILINESTRING ((-79.46687...
5 11020239504 Norfolk Southern Rlwy R1011 MULTILINESTRING ((-79.74031...
6 11020239575 Seaboard Coast Line RR R1011 MULTILINESTRING ((-79.43695...
7 11020239576 Seaboard Coast Line RR R1011 MULTILINESTRING ((-79.47852...
8 11020239577 Seaboard Coast Line RR R1011 MULTILINESTRING ((-79.43695...
9 11020239589 Aberdeen and Rockfish RR R1011 MULTILINESTRING ((-79.38736...
10 11020239591 Aberdeen and Briar Patch RR R1011 MULTILINESTRING ((-79.53848...
length
1 661.3381 [m]
2 657.4261 [m]
3 19982.5998 [m]
4 13888.3385 [m]
5 7194.7745 [m]
6 1061.2335 [m]
7 7824.0945 [m]
8 31756.9803 [m]
9 4547.1970 [m]
10 17103.0691 [m]
2.9.5 st_union and st_combine
While st_combine()
combines multiple sfg
s in a single sf
object into a single sfg
without resolving the internal borders, st_union()
does so with the internal borders resolved. Let’s take a look at what st_combine()
does first using NE_counties
.
NE_counties
has three features as you can see below.
NE_counties
Simple feature collection with 3 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 239494.1 ymin: 4430632 xmax: 310778.1 ymax: 4543676
Projected CRS: WGS 84 / UTM zone 14N
STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD ALAND
1 31 135 00835889 0500000US31135 31135 Perkins 06 2287828025
2 31 029 00835836 0500000US31029 31029 Chase 06 2316533447
3 31 057 00835850 0500000US31057 31057 Dundy 06 2381956151
AWATER geometry area
1 2840176 MULTIPOLYGON (((243340.2 45... 2302174854
2 7978172 MULTIPOLYGON (((241201.4 44... 2316908196
3 3046331 MULTIPOLYGON (((240811.3 44... 2389890530
After applying st_combine()
, the resulting object has only a single feature.
(
NE_counties_combined <- st_combine(NE_counties)
)
Geometry set for 1 feature
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 239494.1 ymin: 4430632 xmax: 310778.1 ymax: 4543676
Projected CRS: WGS 84 / UTM zone 14N
As you can see in the figure below, the internal borders did not get resolved. It is just that all the three counties are now recognized as a single feature rathern than three separate features.
ggplot(NE_counties_combined) +
geom_sf() +
theme_void()
On the other hand, st_union()
resovle the internal borders, while combining all the features (A use case of st_union()
can be seen in section 1.6).
(
NE_counties_union <- st_union(NE_counties)
)
Geometry set for 1 feature
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 239494.1 ymin: 4430632 xmax: 310778.1 ymax: 4543676
Projected CRS: WGS 84 / UTM zone 14N
ggplot(data = NE_counties_union) +
geom_sf() +
theme_void()
Note also that the resulting objects of st_combine()
and st_union()
are no logner sf
, rather they are sfc
, which do not retain the original variables in NE_counties
.
class(NE_counties_combined)
[1] "sfc_MULTIPOLYGON" "sfc"
class(NE_counties_union)
[1] "sfc_POLYGON" "sfc"
2.9.6 st_simplify and ms_simplify
Sometimes, a vector data is excessively detailed for your purpose. For example, when showing the relative location of Japan to the surrounding countries in a map, the spatial object representing Japan does not have to be very accurate in its coaslines, say at the meter level. In such cases, you can simplify the spatial object using st_simplify()
so that you can render the map much faster.
Here, we use IL County borders for illustration.
IL_counties <- tigris::counties(state = "Illinois", cb = TRUE, progress_bar = FALSE)
Here is the object size of IL_counties
.
object.size(IL_counties)
419672 bytes
Let’s plot this.
ggplot(IL_counties) +
geom_sf() +
theme_void()
However, if you focus on a single county, you can see that county borders in this data are fairly detaield. Let’s look at the Cook county.
Cook <- filter(IL_counties, NAME == "Cook")
(
g_cook_original <-
ggplot(Cook) +
geom_sf() +
coord_sf(lims_method = "geometry_bbox") +
theme_void()
)
You can see that its border with Lake Michigan has lots of non-linear segments. Let’s simplify this using st_simplify()
. We can use the dTolerance
option to specify the degree of simplification.
Cook_simplify <- st_simplify(Cook, dTolerance = 1000)
g_cook_simplified <-
ggplot(Cook_simplify) +
geom_sf() +
coord_sf(lims_method = "geometry_bbox") +
theme_void()
The figure below compares the original (left) and simplified (right) versions.
g_cook_original | g_cook_simplified
Let’s now simplify IL_counties
using st_simplify()
to make its map light-weight.
IL_counties_simplified <- st_simplify(IL_counties, dTolerance = 1000)
ggplot(IL_counties_simplified) +
geom_sf(fill = "blue", alpha = 0.5) +
theme_void()
You probably notice we now have some gaps between some counties. This is because st_simplify()
does not respect the internal borders. To simplify only the outer borders, you can use rmapshaper::ms_simplify()
. The keep
option controls the degree of simplifcation (the lower, the more simplified).
IL_counties_mssimplified <- rmapshaper::ms_simplify(IL_counties, keep = 0.01)
ggplot(IL_counties_mssimplified) +
geom_sf(fill = "blue", alpha = 0.5) +
theme_void()
You can see that internal borders are simplified in a way that does not create holes (see section 1.6 for an application that makes use of ms_simplify()
).