Skip to contents

This document intended to get you up to speed with using rsgeo and comfortable branching out from using sf more generally.

Understanding sf

Most R users that do geospatial analysis are familiar with sf and understand spatial workflows in that context only. Many users of sf view it as this magical data frame that lets you do spatial analysis—and that is exactly how it feels! This means that sf has done a great job in making spatial analysis feel a lot easier for the vast majority of R users.

But sf needs to be understood in a more fundamental way. sf is named after the Simple Feature Access Standard. Simple features are an agreed upon way to represent “geometric primitives”—things like points, linestrings, polygons, and their multi- types. The sf package builds a hierarchy off of these. We have sfg, sfc, and sf objects.

sfg objects

At the core is the sfg class which are representations of simple feature geometries. sfg class objects are representations of a simple feature. They are a single geometry at a time. We can think of them as a scalar value (even though R does not have the concept of a scalar).

library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

# create a point
pnt <- st_point(c(0, 10))

# create a line
ln <- st_linestring(matrix(c(0, 1, 0, 0), ncol = 2))

class(pnt)
#> [1] "XY"    "POINT" "sfg"
class(ln)
#> [1] "XY"         "LINESTRING" "sfg"

Each scalar value can be used independently which is useful in itself.

st_length(ln)
#> [1] 1

sfg objects are very simple objects constructed of numeric vectors, matrices, and lists of matrices. They also have no sense of a coordinate reference system (CRS).

But more often we want to have a vector of geometries. A vector of geometries is stored in an sfc object.

sfc objects

sfc is short for simple feature column. Under the hood, this is just a list of sfg objects. You might think that you could create a vector of geometries by combining them using c() but that would be wrong.

c(pnt, pnt)
#> MULTIPOINT ((0 10), (0 10))
c(pnt, ln)
#> GEOMETRYCOLLECTION (POINT (0 10), LINESTRING (0 0, 1 0))

sfg objects behave like scalars so combining them creates either a multi- type of the sfg or a geometry collection (another type of simple feature).

To create a vector of geometries you must use st_sfc(). st_sfc() is the construct for creating a “simple feature geometry list column.” It lets you also set a number of attributes that are associated with the vector.

st_sfc(pnt, pnt)
#> Geometry set for 2 features 
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 0 ymin: 10 xmax: 0 ymax: 10
#> CRS:           NA
#> POINT (0 10)
#> POINT (0 10)

sfc objects have attributes such as a CRS, bounding box, and precision. sfc objects behave more like the vectors that you are familiar with .

c(
  st_sfc(pnt, pnt), 
  st_sfc(pnt),
  st_sfc(ln)
)
#> Geometry set for 4 features 
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 10
#> CRS:           NA
#> POINT (0 10)
#> POINT (0 10)
#> POINT (0 10)
#> LINESTRING (0 0, 1 0)

Since sfc objects are vectors, they can be included as a column in a data frame.

df <- data.frame(
  geo = st_sfc(pnt, pnt)
)

df
#>       geometry
#> 1 POINT (0 10)
#> 2 POINT (0 10)

class(df)
#> [1] "data.frame"

Having geometry included in a data frame is a huge win for the R community because this means they can included attributes along with the geometries.

For example we can create a vector of points from the x and y columns from the diamonds dataset in ggplot2.

data(diamonds, package = "ggplot2")

pnts <- purrr::map2(
  diamonds$x, 
  diamonds$y, 
  function(.x, .y) st_point(c(.x, .y))
) |> 
  st_sfc()

We can included this in a data frame

library(dplyr, warn.conflicts = FALSE)

dmnd <- diamonds |> 
  select(price, clarity) |> 
  bind_cols(geometry = pnts)

head(dmnd)
#> # A tibble: 6 × 3
#>   price clarity    geometry
#>   <int> <ord>       <POINT>
#> 1   326 SI2     (3.95 3.98)
#> 2   326 SI1     (3.89 3.84)
#> 3   327 VS1     (4.05 4.07)
#> 4   334 VS2      (4.2 4.23)
#> 5   335 SI2     (4.34 4.35)
#> 6   336 VVS2    (3.94 3.96)

Now we have a tibble with price, clarity, and geometry! Huge win! What if we wanted to calculate the average price by clarity and keep the geometries?

dmnd |> 
  group_by(clarity) |> 
  summarise(avg_price = mean(price))
#> # A tibble: 8 × 2
#>   clarity avg_price
#>   <ord>       <dbl>
#> 1 I1          3924.
#> 2 SI2         5063.
#> 3 SI1         3996.
#> 4 VS2         3925.
#> 5 VS1         3839.
#> 6 VVS2        3284.
#> 7 VVS1        2523.
#> 8 IF          2865.

Well, we lose the geometry just like we would for any other column that wasn’t included in the summarise() call. To keep it, we would need to perform an operation on the geometry itself.

dmnd |> 
  group_by(clarity) |> 
  summarise(
    avg_price = mean(price),
    geometry = st_union(geometry)
  )
#> # A tibble: 8 × 3
#>   clarity avg_price                                                     geometry
#>   <ord>       <dbl>                                                 <MULTIPOINT>
#> 1 I1          3924. ((4.33 4.29), (4.33 4.36), (4.36 4.33), (4.38 4.42), (4.39 …
#> 2 SI2         5063. ((0 0), (0 6.62), (3.79 3.75), (3.84 3.82), (3.87 3.85), (3…
#> 3 SI1         3996. ((3.88 3.84), (3.89 3.84), (3.9 3.85), (3.93 3.96), (3.95 3…
#> 4 VS2         3925. ((0 0), (3.73 3.68), (3.73 3.71), (3.74 3.71), (3.76 3.73),…
#> 5 VS1         3839. ((0 0), (3.83 3.85), (3.84 3.87), (3.86 3.89), (3.88 3.9), …
#> 6 VVS2        3284. ((3.83 3.86), (3.85 3.89), (3.85 3.9), (3.85 3.91), (3.86 3…
#> 7 VVS1        2523. ((0 0), (3.83 3.85), (3.87 3.9), (3.88 3.95), (3.88 3.99), …
#> 8 IF          2865. ((3.86 3.88), (3.89 3.9), (3.91 3.95), (3.92 3.94), (3.93 3…

Well, wouldn’t it be nice if the geometry knew to do that? Well, that is exactly what sf objects are.

sf objects

sf objects are simply just data frames with a geometry column that is sticky and smart. We can create an sf object if an sfc column is present in a data frame by using st_as_sf().

dmnd_sf <- st_as_sf(dmnd)

Doing this creates an object of class sf. The two things that make an sf object so special are the class sf and the the attribute sf_column.

attr(dmnd_sf, "sf_column")
#> [1] "geometry"

This attribute tells us which column is the geometry. Because we have this sf can implement its own methods for common functions like select(), mutate(), aggregate(), group_by(), etc which always keep the attr(x, "sf_column") attached to the data frame.

Having the class and attribute allow methods like summarise() to be written for the class itself and handle the things like unioning geometry for us.

dmnd_sf |> 
  group_by(clarity) |> 
  summarise(avg_price = mean(price))
#> Simple feature collection with 8 features and 2 fields
#> Geometry type: MULTIPOINT
#> Dimension:     XY
#> Bounding box:  xmin: 0 ymin: 0 xmax: 10.74 ymax: 58.9
#> CRS:           NA
#> # A tibble: 8 × 3
#>   clarity avg_price                                                     geometry
#>   <ord>       <dbl>                                                 <MULTIPOINT>
#> 1 I1          3924. ((4.33 4.29), (4.33 4.36), (4.36 4.33), (4.38 4.42), (4.39 …
#> 2 SI2         5063. ((0 0), (0 6.62), (3.79 3.75), (3.84 3.82), (3.87 3.85), (3…
#> 3 SI1         3996. ((3.88 3.84), (3.89 3.84), (3.9 3.85), (3.93 3.96), (3.95 3…
#> 4 VS2         3925. ((0 0), (3.73 3.68), (3.73 3.71), (3.74 3.71), (3.76 3.73),…
#> 5 VS1         3839. ((0 0), (3.83 3.85), (3.84 3.87), (3.86 3.89), (3.88 3.9), …
#> 6 VVS2        3284. ((3.83 3.86), (3.85 3.89), (3.85 3.9), (3.85 3.91), (3.86 3…
#> 7 VVS1        2523. ((0 0), (3.83 3.85), (3.87 3.9), (3.88 3.95), (3.88 3.99), …
#> 8 IF          2865. ((3.86 3.88), (3.89 3.9), (3.91 3.95), (3.92 3.94), (3.93 3…

Note that this is just like what we wrote earlier except that we didn’t have to handle the geometry column manually. If we compare these two approaches and ignore attribute difference like sf_column and class we can see that they are identical.

x <- dmnd |> 
  group_by(clarity) |> 
  summarise(
    avg_price = mean(price),
    geometry = st_union(geometry)
  )

y <- dmnd_sf |> 
  group_by(clarity) |> 
  summarise(avg_price = mean(price))

all.equal(x, y, check.attributes = FALSE)
#> [1] TRUE

Freeing yourself

Knowing how sf works can be helpful for understanding how we can ease our reliance on it for all geospatial operations. Instead of thinking of the entire sf data frame as the thing that handles all geometry operations we now know that it is the sfc geometry column.

In R there are different ways of representing geometry besides sfc vectors. The packages s2, geos, wk, and rsgeo all provide different vectors of geometries that can be used.

These libraries are very handy for doing geometric operations. Each of these packages tend to be better at one thing than another and each have their place. You will often find big speed improvements if you use these libraries instead of sf for certain things.

Take for example calculating the length of linestrings on a geodesic. We can use the roxel dataset from sfnetworks.

data(roxel, package = "sfnetworks")

To illustrate we can extract the geometry column and cast it as an rsgeo class object.

library(rsgeo)
geo <- roxel$geometry
rs <- as_rsgeo(geo)

We can then use the functions st_length() and length_haversine() respectively to compute the length of linestrings.

bench::mark(
  st_length(geo),
  length_haversine(rs),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 st_length(geo)         2.28ms   2.37ms      409.    2.89MB     17.1
#> 2 length_haversine(rs) 322.55µs 756.37µs     1301.     8.8KB      0

This is markedly faster when using rsgeo. We also do not have to extract the vector and work with it as its own object. Any vector can be included in a data frame, remember?

roxel |> 
    as_tibble() |> 
    mutate(
      geometry = as_rsgeo(geometry),
      length = length_haversine(geometry)
    )
#> # A tibble: 851 × 4
#>    name                  type       
#>    <chr>                 <fct>      
#>  1 Havixbecker Strasse   residential
#>  2 Pienersallee          secondary  
#>  3 Schulte-Bernd-Strasse residential
#>  4 NA                    path       
#>  5 Welsingheide          residential
#>  6 NA                    footway    
#>  7 NA                    footway    
#>  8 NA                    path       
#>  9 NA                    track      
#> 10 NA                    track      
#> # ℹ 841 more rows
#> # ℹ 2 more variables: geometry <r_LINEST>, length <dbl>