2.8 Non-spatial transformation of sf

2.8 Non-spatial transformation of `sf`

2.8.1 Using `dplyr`

An important feature of an sf object is that it is basically a data.frame with geometric information stored as a variable (column). This means that transforming an sf object works just like transforming a data.frame. Basically, everything you can do to a data.frame, you can do to an sf as well. The code below just provides an example of basic operations including dplyr::select(), dplyr::filter(), and dplyr::mutate() in action with an sf object to just confirm that dplyr operations works with an sf object just like a data.frame.

#--- here is what the data looks like ---#
dplyr::select(wells_sf, wellid, nrdname, acres, regdate, nrdname)

Simple feature collection with 105822 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -104.0531 ymin: 40.00161 xmax: -96.87681 ymax: 41.85942
Geodetic CRS:  NAD83
First 10 features:
   wellid           nrdname acres  regdate                   geometry
1       2    Central Platte   160 12/30/55 POINT (-99.58401 40.69825)
2       3      South Platte    46  4/29/31 POINT (-102.6249 41.11699)
3       4      South Platte    46  4/29/31 POINT (-102.6249 41.11699)
4       5      South Platte    46  4/29/31 POINT (-102.6249 41.11699)
5       6    Central Platte   160  8/29/32  POINT (-99.6258 40.73268)
6       7    Central Platte   120  2/15/35 POINT (-99.64524 40.73164)
7       8      South Platte   113   8/7/37 POINT (-103.5257 41.24492)
8      10      South Platte   160   5/4/38 POINT (-103.0284 41.13243)
9      11 Middle Republican   807   5/6/38  POINT (-101.1193 40.3527)
10     12 Middle Republican   148 11/29/77 POINT (-101.1146 40.35631)

Notice that geometry column will be retained after dplyr::select() even if you did not tell R to keep it above.

Let’s apply dplyr::select(), dplyr::filter(), and dplyr::mutate() to the dataset.

#--- do some transformations ---#
wells_sf %>%
  #--- select variables (geometry will always remain after select) ---#
  dplyr::select(wellid, nrdname, acres, regdate, nrdname) %>%
  #--- removes observations with acre < 30  ---#
  dplyr::filter(acres > 30) %>%
  #--- hectare instead of acre ---#
  dplyr::mutate(hectare = acres * 0.404686)

Simple feature collection with 63271 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -104.0529 ymin: 40.00161 xmax: -96.87681 ymax: 41.73599
Geodetic CRS:  NAD83
First 10 features:
   wellid           nrdname acres  regdate                   geometry   hectare
1       2    Central Platte   160 12/30/55 POINT (-99.58401 40.69825)  64.74976
2       3      South Platte    46  4/29/31 POINT (-102.6249 41.11699)  18.61556
3       4      South Platte    46  4/29/31 POINT (-102.6249 41.11699)  18.61556
4       5      South Platte    46  4/29/31 POINT (-102.6249 41.11699)  18.61556
5       6    Central Platte   160  8/29/32  POINT (-99.6258 40.73268)  64.74976
6       7    Central Platte   120  2/15/35 POINT (-99.64524 40.73164)  48.56232
7       8      South Platte   113   8/7/37 POINT (-103.5257 41.24492)  45.72952
8      10      South Platte   160   5/4/38 POINT (-103.0284 41.13243)  64.74976
9      11 Middle Republican   807   5/6/38  POINT (-101.1193 40.3527) 326.58160
10     12 Middle Republican   148 11/29/77 POINT (-101.1146 40.35631)  59.89353

Now, let’s try to get a summary of a variable by group using the group_by() and summarize() functions. Here, we use only the first 100 observations because dplyr::summarize() takes just too long.

#--- summary by group ---#
wells_by_nrd <-
  wells_sf[1:100, ] %>%
  #--- group by nrdname ---#
  dplyr::group_by(nrdname) %>%
  #--- summarize ---#
  dplyr::summarize(tot_acres = sum(acres, na.rm = TRUE))

#--- take a look ---#
wells_by_nrd

Simple feature collection with 8 features and 2 fields
Geometry type: MULTIPOINT
Dimension:     XY
Bounding box:  xmin: -103.6467 ymin: 40.0475 xmax: -97.89758 ymax: 41.24492
Geodetic CRS:  NAD83
# A tibble: 8 × 3
  nrdname           tot_acres                                           geometry
  <chr>                 <dbl>                                   <MULTIPOINT [°]>
1 Central Platte        6690. ((-99.00983 40.73009), (-99.03167 40.71927), (-99…
2 Lower Republican       350  ((-99.30403 40.05116), (-99.21946 40.05126), (-99…
3 Middle Republican     1396  ((-100.9216 40.18866), (-101.1663 40.36351), (-10…
4 South Platte          1381  ((-103.6467 41.2375), (-103.6128 41.23761), (-103…
5 Tri-Basin              480  ((-98.94967 40.64285), (-98.94906 40.65061), (-98…
6 Twin Platte           1098. ((-100.9571 41.18633), (-100.8659 41.14635), (-10…
7 Upper Big Blue         330        ((-97.89758 40.86156), (-97.89998 40.86336))
8 Upper Republican       480         ((-101.8858 40.47776), (-101.6501 40.6012))

So, we can summarize an sf by group using dplyr::group_by() and dplyr::summarize(). One interesting change that happened is the geometry variable. Each NRD now has multipoint sfg, which is the combination of all the wells (points) located inside the NRD. Now, it is hard to imagine that you need summarized geometries after group summary. Moreover, it is a very slow operation. If you have lots of free time, try running the above code with wells_sf instead of wells_sf[1:100, ]. I never waited for it to finish as it was running for a long long time. It is advised that you simply drop the geometry and turn the sf object into a data.frame (or tibble, data.table) first and then do group summary.

#--- remove geometry ---#
wells_no_longer_sf <- st_drop_geometry(wells_sf)

#--- take a look ---#
head(wells_no_longer_sf)

  wellid ownerid        nrdname acres  regdate section
1      2  106106 Central Platte   160 12/30/55       3
2      3   14133   South Platte    46  4/29/31       8
3      4   14133   South Platte    46  4/29/31       8
4      5   14133   South Platte    46  4/29/31       8
5      6   15837 Central Platte   160  8/29/32      20
6      7   90248 Central Platte   120  2/15/35      19

We can now do a group summary very quickly:

wells_no_longer_sf %>%
  #--- group by nrdname ---#
  dplyr::group_by(nrdname) %>%
  #--- summarize ---#
  dplyr::summarize(tot_acres = sum(acres, na.rm = TRUE))

# A tibble: 9 × 2
  nrdname           tot_acres
  <chr>                 <dbl>
1 Central Platte     1890918.
2 Little Blue         995900.
3 Lower Republican    543079.
4 Middle Republican   443472.
5 South Platte        216109.
6 Tri-Basin           847058.
7 Twin Platte         452678.
8 Upper Big Blue     1804782.
9 Upper Republican    551906.

2.8.2 Using `data.table`

The data.table package provides data wrangling options that are extremely fast (see here for various benchmark results). It particularly shines when datasets are large and is much faster than dplyr. However, it cannot be as naturally integrated into the workflow involving sf objects as dplyr can. Let’s convert an sf object of points into a data.table object using data.table().

#--- convert an sf to data.table ---#
(
  wells_dt <- data.table(wells_sf)
)

        wellid ownerid        nrdname acres  regdate section
     1:      2  106106 Central Platte   160 12/30/55       3
     2:      3   14133   South Platte    46  4/29/31       8
     3:      4   14133   South Platte    46  4/29/31       8
     4:      5   14133   South Platte    46  4/29/31       8
     5:      6   15837 Central Platte   160  8/29/32      20
    ---                                                     
105818: 244568  135045 Upper Big Blue    NA  8/26/16      30
105819: 244569  105428    Little Blue    NA  8/26/16      24
105820: 244570  135045 Upper Big Blue    NA  8/26/16      30
105821: 244571  135045 Upper Big Blue    NA  8/26/16      25
105822: 244572  105428    Little Blue    NA  8/26/16      15
                          geometry
     1: POINT (-99.58401 40.69825)
     2: POINT (-102.6249 41.11699)
     3: POINT (-102.6249 41.11699)
     4: POINT (-102.6249 41.11699)
     5:  POINT (-99.6258 40.73268)
    ---                           
105818: POINT (-97.58872 40.89017)
105819: POINT (-97.60752 40.13257)
105820: POINT (-97.58294 40.88722)
105821: POINT (-97.59775 40.89639)
105822:  POINT (-97.64086 40.1338)

#--- check the class ---#
class(wells_dt)

[1] "data.table" "data.frame"

You see that wells_dt is no longer an sf object, but the geometry column still remains in the data.

When you convert an sf of polygons into a data.table, then the geometry column appears to have lost the geometry information as all the entries are just <XY[number]>.

(
  nc_dt <- data.table(nc) %>% head()
)

    AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
1: 0.114     1.442  1825    1825        Ashe 37009  37009        5  1091     1
2: 0.061     1.231  1827    1827   Alleghany 37005  37005        3   487     0
3: 0.143     1.630  1828    1828       Surry 37171  37171       86  3188     5
4: 0.070     2.968  1831    1831   Currituck 37053  37053       27   508     1
5: 0.153     2.206  1832    1832 Northampton 37131  37131       66  1421     9
6: 0.097     1.670  1833    1833    Hertford 37091  37091       46  1452     7
   NWBIR74 BIR79 SID79 NWBIR79 geometry
1:      10  1364     0      19  <XY[1]>
2:      10   542     3      12  <XY[1]>
3:     208  3616     6     260  <XY[1]>
4:     123   830     2     145  <XY[3]>
5:    1066  1606     3    1197  <XY[1]>
6:     954  1838     5    1237  <XY[1]>

But, the geometry information is still there as you can see from this:

#--- take a look at what's inside the geometry column ---#
nc_dt$geometry

Geometry set for 6 features 
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS:  NAD27
First 5 geometries:

#--- check the class of the geometry column ---#
class(nc_dt$geometry)

[1] "sfc_MULTIPOLYGON" "sfc"

If you try to run sf operations on a data.table obejct with a geometry colum, it will of course give you an error. Like this

st_buffer(nc_dt, dist = 2)

Error in UseMethod("st_buffer"): no applicable method for 'st_buffer' applied to an object of class "c('data.table', 'data.frame')"

However, you can apply sf spatial operations only on the geometry like this:

nc_dt$geometry %>%
  st_buffer(dist = 2) %>%
  head()

Geometry set for 6 features 
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -81.74315 ymin: 36.07122 xmax: -75.77203 ymax: 36.59095
Geodetic CRS:  NAD27
First 5 geometries:

Finally, it is easy to revert a data.table object back to an sf object again by using the st_as_sf() function.

#--- wells ---#
(
  wells_sf_again <- st_as_sf(wells_dt)
)

Simple feature collection with 105822 features and 6 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -104.0531 ymin: 40.00161 xmax: -96.87681 ymax: 41.85942
Geodetic CRS:  NAD83
First 10 features:
   wellid ownerid           nrdname acres  regdate section
1       2  106106    Central Platte   160 12/30/55       3
2       3   14133      South Platte    46  4/29/31       8
3       4   14133      South Platte    46  4/29/31       8
4       5   14133      South Platte    46  4/29/31       8
5       6   15837    Central Platte   160  8/29/32      20
6       7   90248    Central Platte   120  2/15/35      19
7       8   48113      South Platte   113   8/7/37      28
8      10   17073      South Platte   160   5/4/38       2
9      11   98432 Middle Republican   807   5/6/38      36
10     12   79294 Middle Republican   148 11/29/77      31
                     geometry
1  POINT (-99.58401 40.69825)
2  POINT (-102.6249 41.11699)
3  POINT (-102.6249 41.11699)
4  POINT (-102.6249 41.11699)
5   POINT (-99.6258 40.73268)
6  POINT (-99.64524 40.73164)
7  POINT (-103.5257 41.24492)
8  POINT (-103.0284 41.13243)
9   POINT (-101.1193 40.3527)
10 POINT (-101.1146 40.35631)

#--- nc polygons ---#
(
  nc_sf_again <- st_as_sf(nc_dt)
)

Simple feature collection with 6 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS:  NAD27
   AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
1 0.114     1.442  1825    1825        Ashe 37009  37009        5  1091     1
2 0.061     1.231  1827    1827   Alleghany 37005  37005        3   487     0
3 0.143     1.630  1828    1828       Surry 37171  37171       86  3188     5
4 0.070     2.968  1831    1831   Currituck 37053  37053       27   508     1
5 0.153     2.206  1832    1832 Northampton 37131  37131       66  1421     9
6 0.097     1.670  1833    1833    Hertford 37091  37091       46  1452     7
  NWBIR74 BIR79 SID79 NWBIR79                       geometry
1      10  1364     0      19 MULTIPOLYGON (((-81.47276 3...
2      10   542     3      12 MULTIPOLYGON (((-81.23989 3...
3     208  3616     6     260 MULTIPOLYGON (((-80.45634 3...
4     123   830     2     145 MULTIPOLYGON (((-76.00897 3...
5    1066  1606     3    1197 MULTIPOLYGON (((-77.21767 3...
6     954  1838     5    1237 MULTIPOLYGON (((-76.74506 3...

So, this means that if you need fast data transformation, you can first turn an sf to a data.table, transform the data using the data.table functionality, and then revert back to sf. However, for most economists, the geometry variable itself is not of interest in the sense that it never enters econometric models. For most of us, the geographic information contained in the geometry variable is just a glue to tie two datasets together by geographic referencing. Once we get values of spatial variables of interest, there is no point in keeping your data as an sf object. Personally, whenever I no longer need to carry around the geometry variable, I immediately turn an sf object into a data.table for fast data transformation especially when the data is large.

Those who know the dtplyr package (it takes advantage of the speed of data.table while you can keep using dplyr syntax and functions) may wonder if it works well with sf objects. Nope:

library(dtplyr)

#--- convert an "lazy" data.table ---#
wells_ldt <- lazy_dt(wells_sf)

#--- try  ---#
st_buffer(wells_ldt, dist = 2)

Error in UseMethod("st_buffer"): no applicable method for 'st_buffer' applied to an object of class "c('dtplyr_step_first', 'dtplyr_step')"

By the way, this package is awesome if you really love dplyr, but want the speed of data.table. dtplyr is of course slightly slower than data.table because internal translations of dplyr language to data.table language have to happen first.⁵²

I personally use data.table unless it is necessary to use dplyr like when dealing with sf objects. It is more concise than dplyr, which is somewhat verbose (yet expressive because of it). Ultimately, it is your personal preference which to use. You might be interested in reading this discussion about the comparative advantages and disadvantages of the two packages.↩︎

2.8.1 Using dplyr

2.8.2 Using data.table