2.8 Non-spatial transformation of sf
2.8.1 Using dplyr
An important feature of an sf
object is that it is basically a data.frame
with geometric information stored as a variable (column). This means that transforming an sf
object works just like transforming a data.frame
. Basically, everything you can do to a data.frame
, you can do to an sf
as well. The code below just provides an example of basic operations including dplyr::select()
, dplyr::filter()
, and dplyr::mutate()
in action with an sf
object to just confirm that dplyr
operations works with an sf
object just like a data.frame
.
#--- here is what the data looks like ---#
::select(wells_sf, wellid, nrdname, acres, regdate, nrdname) dplyr
Simple feature collection with 105822 features and 4 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -104.0531 ymin: 40.00161 xmax: -96.87681 ymax: 41.85942
Geodetic CRS: NAD83
First 10 features:
wellid nrdname acres regdate geometry
1 2 Central Platte 160 12/30/55 POINT (-99.58401 40.69825)
2 3 South Platte 46 4/29/31 POINT (-102.6249 41.11699)
3 4 South Platte 46 4/29/31 POINT (-102.6249 41.11699)
4 5 South Platte 46 4/29/31 POINT (-102.6249 41.11699)
5 6 Central Platte 160 8/29/32 POINT (-99.6258 40.73268)
6 7 Central Platte 120 2/15/35 POINT (-99.64524 40.73164)
7 8 South Platte 113 8/7/37 POINT (-103.5257 41.24492)
8 10 South Platte 160 5/4/38 POINT (-103.0284 41.13243)
9 11 Middle Republican 807 5/6/38 POINT (-101.1193 40.3527)
10 12 Middle Republican 148 11/29/77 POINT (-101.1146 40.35631)
Notice that geometry
column will be retained after dplyr::select()
even if you did not tell R to keep it above.
Let’s apply dplyr::select()
, dplyr::filter()
, and dplyr::mutate()
to the dataset.
#--- do some transformations ---#
%>%
wells_sf #--- select variables (geometry will always remain after select) ---#
::select(wellid, nrdname, acres, regdate, nrdname) %>%
dplyr#--- removes observations with acre < 30 ---#
::filter(acres > 30) %>%
dplyr#--- hectare instead of acre ---#
::mutate(hectare = acres * 0.404686) dplyr
Simple feature collection with 63271 features and 5 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -104.0529 ymin: 40.00161 xmax: -96.87681 ymax: 41.73599
Geodetic CRS: NAD83
First 10 features:
wellid nrdname acres regdate geometry hectare
1 2 Central Platte 160 12/30/55 POINT (-99.58401 40.69825) 64.74976
2 3 South Platte 46 4/29/31 POINT (-102.6249 41.11699) 18.61556
3 4 South Platte 46 4/29/31 POINT (-102.6249 41.11699) 18.61556
4 5 South Platte 46 4/29/31 POINT (-102.6249 41.11699) 18.61556
5 6 Central Platte 160 8/29/32 POINT (-99.6258 40.73268) 64.74976
6 7 Central Platte 120 2/15/35 POINT (-99.64524 40.73164) 48.56232
7 8 South Platte 113 8/7/37 POINT (-103.5257 41.24492) 45.72952
8 10 South Platte 160 5/4/38 POINT (-103.0284 41.13243) 64.74976
9 11 Middle Republican 807 5/6/38 POINT (-101.1193 40.3527) 326.58160
10 12 Middle Republican 148 11/29/77 POINT (-101.1146 40.35631) 59.89353
Now, let’s try to get a summary of a variable by group using the group_by()
and summarize()
functions. Here, we use only the first 100 observations because dplyr::summarize()
takes just too long.
#--- summary by group ---#
<-
wells_by_nrd 1:100, ] %>%
wells_sf[#--- group by nrdname ---#
::group_by(nrdname) %>%
dplyr#--- summarize ---#
::summarize(tot_acres = sum(acres, na.rm = TRUE))
dplyr
#--- take a look ---#
wells_by_nrd
Simple feature collection with 8 features and 2 fields
Geometry type: MULTIPOINT
Dimension: XY
Bounding box: xmin: -103.6467 ymin: 40.0475 xmax: -97.89758 ymax: 41.24492
Geodetic CRS: NAD83
# A tibble: 8 × 3
nrdname tot_acres geometry
<chr> <dbl> <MULTIPOINT [°]>
1 Central Platte 6690. ((-99.00983 40.73009), (-99.03167 40.71927), (-99…
2 Lower Republican 350 ((-99.30403 40.05116), (-99.21946 40.05126), (-99…
3 Middle Republican 1396 ((-100.9216 40.18866), (-101.1663 40.36351), (-10…
4 South Platte 1381 ((-103.6467 41.2375), (-103.6128 41.23761), (-103…
5 Tri-Basin 480 ((-98.94967 40.64285), (-98.94906 40.65061), (-98…
6 Twin Platte 1098. ((-100.9571 41.18633), (-100.8659 41.14635), (-10…
7 Upper Big Blue 330 ((-97.89758 40.86156), (-97.89998 40.86336))
8 Upper Republican 480 ((-101.8858 40.47776), (-101.6501 40.6012))
So, we can summarize an sf
by group using dplyr::group_by()
and dplyr::summarize()
. One interesting change that happened is the geometry variable. Each NRD now has multipoint
sfg, which is the combination of all the wells (points) located inside the NRD. Now, it is hard to imagine that you need summarized
geometries after group summary. Moreover, it is a very slow operation. If you have lots of free time, try running the above code with wells_sf
instead of wells_sf[1:100, ]
. I never waited for it to finish as it was running for a long long time. It is advised that you simply drop the geometry and turn the sf
object into a data.frame
(or tibble
, data.table
) first and then do group summary.
#--- remove geometry ---#
<- st_drop_geometry(wells_sf)
wells_no_longer_sf
#--- take a look ---#
head(wells_no_longer_sf)
wellid ownerid nrdname acres regdate section
1 2 106106 Central Platte 160 12/30/55 3
2 3 14133 South Platte 46 4/29/31 8
3 4 14133 South Platte 46 4/29/31 8
4 5 14133 South Platte 46 4/29/31 8
5 6 15837 Central Platte 160 8/29/32 20
6 7 90248 Central Platte 120 2/15/35 19
We can now do a group summary very quickly:
%>%
wells_no_longer_sf #--- group by nrdname ---#
::group_by(nrdname) %>%
dplyr#--- summarize ---#
::summarize(tot_acres = sum(acres, na.rm = TRUE)) dplyr
# A tibble: 9 × 2
nrdname tot_acres
<chr> <dbl>
1 Central Platte 1890918.
2 Little Blue 995900.
3 Lower Republican 543079.
4 Middle Republican 443472.
5 South Platte 216109.
6 Tri-Basin 847058.
7 Twin Platte 452678.
8 Upper Big Blue 1804782.
9 Upper Republican 551906.
2.8.2 Using data.table
The data.table
package provides data wrangling options that are extremely fast (see here for various benchmark results). It particularly shines when datasets are large and is much faster than dplyr
. However, it cannot be as naturally integrated into the workflow involving sf
objects as dplyr
can. Let’s convert an sf
object of points into a data.table
object using data.table()
.
#--- convert an sf to data.table ---#
(<- data.table(wells_sf)
wells_dt )
wellid ownerid nrdname acres regdate section
1: 2 106106 Central Platte 160 12/30/55 3
2: 3 14133 South Platte 46 4/29/31 8
3: 4 14133 South Platte 46 4/29/31 8
4: 5 14133 South Platte 46 4/29/31 8
5: 6 15837 Central Platte 160 8/29/32 20
---
105818: 244568 135045 Upper Big Blue NA 8/26/16 30
105819: 244569 105428 Little Blue NA 8/26/16 24
105820: 244570 135045 Upper Big Blue NA 8/26/16 30
105821: 244571 135045 Upper Big Blue NA 8/26/16 25
105822: 244572 105428 Little Blue NA 8/26/16 15
geometry
1: POINT (-99.58401 40.69825)
2: POINT (-102.6249 41.11699)
3: POINT (-102.6249 41.11699)
4: POINT (-102.6249 41.11699)
5: POINT (-99.6258 40.73268)
---
105818: POINT (-97.58872 40.89017)
105819: POINT (-97.60752 40.13257)
105820: POINT (-97.58294 40.88722)
105821: POINT (-97.59775 40.89639)
105822: POINT (-97.64086 40.1338)
#--- check the class ---#
class(wells_dt)
[1] "data.table" "data.frame"
You see that wells_dt
is no longer an sf
object, but the geometry column still remains in the data.
When you convert an sf
of polygons into a data.table
, then the geometry column appears to have lost the geometry information as all the entries are just <XY[number]>
.
(<- data.table(nc) %>% head()
nc_dt )
AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
1: 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1
2: 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0
3: 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5
4: 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1
5: 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9
6: 0.097 1.670 1833 1833 Hertford 37091 37091 46 1452 7
NWBIR74 BIR79 SID79 NWBIR79 geometry
1: 10 1364 0 19 <XY[1]>
2: 10 542 3 12 <XY[1]>
3: 208 3616 6 260 <XY[1]>
4: 123 830 2 145 <XY[3]>
5: 1066 1606 3 1197 <XY[1]>
6: 954 1838 5 1237 <XY[1]>
But, the geometry information is still there as you can see from this:
#--- take a look at what's inside the geometry column ---#
$geometry nc_dt
Geometry set for 6 features
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS: NAD27
First 5 geometries:
#--- check the class of the geometry column ---#
class(nc_dt$geometry)
[1] "sfc_MULTIPOLYGON" "sfc"
If you try to run sf
operations on a data.table
obejct with a geometry colum, it will of course give you an error. Like this
st_buffer(nc_dt, dist = 2)
Error in UseMethod("st_buffer"): no applicable method for 'st_buffer' applied to an object of class "c('data.table', 'data.frame')"
However, you can apply sf
spatial operations only on the geometry like this:
$geometry %>%
nc_dtst_buffer(dist = 2) %>%
head()
Geometry set for 6 features
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -81.74315 ymin: 36.07122 xmax: -75.77203 ymax: 36.59095
Geodetic CRS: NAD27
First 5 geometries:
Finally, it is easy to revert a data.table
object back to an sf
object again by using the st_as_sf()
function.
#--- wells ---#
(<- st_as_sf(wells_dt)
wells_sf_again )
Simple feature collection with 105822 features and 6 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -104.0531 ymin: 40.00161 xmax: -96.87681 ymax: 41.85942
Geodetic CRS: NAD83
First 10 features:
wellid ownerid nrdname acres regdate section
1 2 106106 Central Platte 160 12/30/55 3
2 3 14133 South Platte 46 4/29/31 8
3 4 14133 South Platte 46 4/29/31 8
4 5 14133 South Platte 46 4/29/31 8
5 6 15837 Central Platte 160 8/29/32 20
6 7 90248 Central Platte 120 2/15/35 19
7 8 48113 South Platte 113 8/7/37 28
8 10 17073 South Platte 160 5/4/38 2
9 11 98432 Middle Republican 807 5/6/38 36
10 12 79294 Middle Republican 148 11/29/77 31
geometry
1 POINT (-99.58401 40.69825)
2 POINT (-102.6249 41.11699)
3 POINT (-102.6249 41.11699)
4 POINT (-102.6249 41.11699)
5 POINT (-99.6258 40.73268)
6 POINT (-99.64524 40.73164)
7 POINT (-103.5257 41.24492)
8 POINT (-103.0284 41.13243)
9 POINT (-101.1193 40.3527)
10 POINT (-101.1146 40.35631)
#--- nc polygons ---#
(<- st_as_sf(nc_dt)
nc_sf_again )
Simple feature collection with 6 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS: NAD27
AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1
2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0
3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5
4 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1
5 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9
6 0.097 1.670 1833 1833 Hertford 37091 37091 46 1452 7
NWBIR74 BIR79 SID79 NWBIR79 geometry
1 10 1364 0 19 MULTIPOLYGON (((-81.47276 3...
2 10 542 3 12 MULTIPOLYGON (((-81.23989 3...
3 208 3616 6 260 MULTIPOLYGON (((-80.45634 3...
4 123 830 2 145 MULTIPOLYGON (((-76.00897 3...
5 1066 1606 3 1197 MULTIPOLYGON (((-77.21767 3...
6 954 1838 5 1237 MULTIPOLYGON (((-76.74506 3...
So, this means that if you need fast data transformation, you can first turn an sf
to a data.table
, transform the data using the data.table
functionality, and then revert back to sf
. However, for most economists, the geometry variable itself is not of interest in the sense that it never enters econometric models. For most of us, the geographic information contained in the geometry variable is just a glue to tie two datasets together by geographic referencing. Once we get values of spatial variables of interest, there is no point in keeping your data as an sf object. Personally, whenever I no longer need to carry around the geometry variable, I immediately turn an sf object into a data.table
for fast data transformation especially when the data is large.
Those who know the dtplyr
package (it takes advantage of the speed of data.table
while you can keep using dplyr
syntax and functions) may wonder if it works well with sf
objects. Nope:
library(dtplyr)
#--- convert an "lazy" data.table ---#
<- lazy_dt(wells_sf)
wells_ldt
#--- try ---#
st_buffer(wells_ldt, dist = 2)
Error in UseMethod("st_buffer"): no applicable method for 'st_buffer' applied to an object of class "c('dtplyr_step_first', 'dtplyr_step')"
By the way, this package is awesome if you really love dplyr
, but want the speed of data.table
. dtplyr
is of course slightly slower than data.table
because internal translations of dplyr
language to data.table
language have to happen first.52
I personally use
data.table
unless it is necessary to usedplyr
like when dealing withsf
objects. It is more concise thandplyr
, which is somewhat verbose (yet expressive because of it). Ultimately, it is your personal preference which to use. You might be interested in reading this discussion about the comparative advantages and disadvantages of the two packages.↩︎