Writing about visualization, demographics, dashboards, and spatial data science.

Interested in learning more? Hire me for a workshop or to consult on your next project. See the Services page for more details.

Generating metropolitan subsets of Census data with R and tigris

· by Kyle Walker · Read in about 4 min · (759 Words)
r census tigris

Need help working with Census data in your project? Contact me at kwalkerdata@gmail.com to discuss consulting support or a training workshop!

Commonly, studies that use US Census data focus on topics at the scale of the metropolitan area. However, subsetting Census geographic data by metropolitan area is not always straightforward. Such a workflow for Census tracts might look something like:

  1. Manually downloading Census tract shapefiles (often available by state);
  2. Looking up the counties in a given metropolitan area, along with their FIPS codes;
  3. Subsetting the data by those FIPS codes.
  4. Repeating the process for each state included in the metropolitan area and then merging the result (if tracts are obtained by state).

This task - which can be tedious - is well-suited for R and the tigris package. R, by way of the sf package, has powerful functionality for spatial subsetting. Further, with a little custom code, we can set up a framework for rapidly retrieving Census tracts for any metropolitan area we want. To get started, let’s load the required packages and set options to cache data from tigris and load the data as sf objects.

library(tigris)
library(sf)
library(tidyverse)
options(tigris_class = "sf")
options(tigris_use_cache = TRUE)

Let’s say we want to obtain Census tracts for the Portland, Oregon metropolitan area, which includes areas in both Oregon and Washington. We can fetch Census tracts for Oregon and Washington and combine them with the rbind_tigris function.

orwa <- rbind_tigris(
  tracts("OR", cb = TRUE), 
  tracts("WA", cb = TRUE)
)

ggplot(orwa) + geom_sf()

We can then subset these tracts spatially by locating the boundary of the Portland metropolitan area available in the core_based_statistical_areas function. Use the View function in RStudio and browse the dataset if you are unsure of how to format the string used to filter the data.

cb <- core_based_statistical_areas(cb = TRUE)

pdx <- filter(cb, grepl("Portland-Vancouver", NAME))

ggplot(pdx) + geom_sf()

The simplest way to do spatial subsetting in sf is by indexing the data you’d like to subset by the spatial overlay. This will return all tracts that intersect the metropolitan area.

p1 <- orwa[pdx,]

ggplot() + 
  geom_sf(data = p1) + 
  geom_sf(data = pdx, fill = NA, color = "red")

We notice a problem here - we’ve returned all tracts within the metropolitan area, but also those outside it that touch the metropolitan area’s boundary. Fortunately, sf includes a suite of methods for spatial overlay. We’ll use the st_within method to identify those tracts within the boundary of the metropolitan area. Further, as the Census Bureau designs its shapefiles to align cleanly, we shouldn’t have any spatial overlay problems here.

w1 <- st_within(orwa, pdx)

print(length(w1))
## [1] 2282
print(w1[1:5])
## [[1]]
## integer(0)
## 
## [[2]]
## integer(0)
## 
## [[3]]
## integer(0)
## 
## [[4]]
## integer(0)
## 
## [[5]]
## [1] 1

We see that the object w1 is a list with the same length as our orwa data frame. Values either are length-0 integers - representing tracts that are not within the metro area - or 1, representing tracts that are. In turn, we can use purrr to convert this to a logical vector and subset our data accordingly.

w2 <- map_lgl(w1, function(x) {
  if (length(x) == 1) {
    return(TRUE)
  } else {
    return(FALSE)
  }
})

p2 <- orwa[w2,]

ggplot() + 
  geom_sf(data = p2) + 
  geom_sf(data = pdx, fill = NA, color = "red")

Perfect! Now, let’s say you want to make this extensible to other metropolitan areas on-demand. We’ll need a function that does the following:

  1. Takes a metropolitan area as input, and detects the states in which the metro area is located;
  2. Retrieves tracts for those states;
  3. Identifies the tracts located within the metro area’s boundary.

Here’s how to do it:

metro_tracts <- function(metro_name) {
  
  # First, identify which states intersect the metro area using the
  # `states` function in tigris
  st <- states(cb = TRUE)
  cb <- core_based_statistical_areas(cb = TRUE)
  metro <- filter(cb, grepl(metro_name, NAME))
  
  stcodes <- st[metro,]$STATEFP
  
  # Then, fetch the tracts, using rbind_tigris if there is more
  # than one state
  if (length(stcodes) > 1) {
    tr <- rbind_tigris(
      map(stcodes, function(x) {
        tracts(x, cb = TRUE)
      })
    )
  } else {
    tr <- tracts(x, cb = TRUE)
  }
  
  # Now, find out which tracts are within the metro area
  within <- st_within(tr, metro)
  
  within_lgl <- map_lgl(within, function(x) {
    if (length(x) == 1) {
      return(TRUE)
    } else {
      return(FALSE)
    }
  })
  
  # Finally, subset and return the output
  output <- tr[within_lgl,]
  
  return(output)
  
}

Let’s try it out! Here’s Chicago:

chi <- metro_tracts("Chicago")

ggplot(chi) + geom_sf()

And here’s Boston:

bos <- metro_tracts("Boston")

ggplot(bos) + geom_sf()