If requested, tidycensus can return simple feature geometry for geographic units along with variables from the decennial US Census or American Community survey. By setting
geometry = TRUE in a tidycensus function call, tidycensus will use the tigris package to retrieve the corresponding geographic dataset from the US Census Bureau and pre-merge it with the tabular data obtained from the Census API. The following example shows median household income from the 2011-2015 ACS for Census tracts in Orange County, California:
library(tidycensus) library(tidyverse) options(tigris_use_cache = TRUE) orange <- get_acs(state = "CA", county = "Orange", geography = "tract", variables = "B19013_001", geometry = TRUE) head(orange)
## Simple feature collection with 6 features and 5 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -117.9594 ymin: 33.8592 xmax: -117.7601 ymax: 33.93685 ## epsg (SRID): 4269 ## proj4string: +proj=longlat +datum=NAD83 +no_defs ## # A tibble: 6 x 6 ## GEOID NAME variable ## <chr> <chr> <chr> ## 1 06059001201 Census Tract 12.01, Orange County, California B19013_001 ## 2 06059001503 Census Tract 15.03, Orange County, California B19013_001 ## 3 06059001902 Census Tract 19.02, Orange County, California B19013_001 ## 4 06059011504 Census Tract 115.04, Orange County, California B19013_001 ## 5 06059021817 Census Tract 218.17, Orange County, California B19013_001 ## 6 06059021824 Census Tract 218.24, Orange County, California B19013_001 ## # ... with 3 more variables: estimate <dbl>, moe <dbl>, geometry <S3: ## # sfc_MULTIPOLYGON>
orange looks much like the basic tidycensus output, but with a
geometry list-column describing the geometry of each feature, using the geographic coordinate system NAD 1983 (EPSG: 4269) which is the default for Census shapefiles. tidycensus uses the Census cartographic boundary shapefiles for faster processing; if you prefer the TIGER/Line shapefiles, set
cb = FALSE in the function call.
As the dataset is in a tidy format, it can be quickly visualized with the
geom_sf functionality currently in the development version of ggplot2:
library(viridis) orange %>% ggplot(aes(fill = estimate, color = estimate)) + geom_sf() + coord_sf(crs = 26911) + scale_fill_viridis(option = "magma") + scale_color_viridis(option = "magma")
One of the most powerful features of ggplot2 is its support for small multiples, which works very well with the tidy data format returned by tidycensus. Many Census and ACS variables return counts, however, which are generally inappropriate for choropleth mapping. In turn,
get_acs have an optional argument,
summary_var, that can work as a multi-group denominator when appropriate. Let’s use the following example of the racial geography of Harris County, Texas. First, we’ll request data for non-Hispanic whites, non-Hispanic blacks, non-Hispanic Asians, and Hispanics by Census tract for the 2010 Census, and specify total population as the summary variable.
year is not necessary here as the default is 2010.
racevars <- c("P0050003", "P0050004", "P0050006", "P0040003") harris <- get_decennial(geography = "tract", variables = racevars, state = "TX", county = "Harris County", geometry = TRUE, summary_var = "P0010001") head(harris)
## Simple feature collection with 6 features and 5 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -95.37457 ymin: 29.74486 xmax: -95.32409 ymax: 29.80907 ## epsg (SRID): 4269 ## proj4string: +proj=longlat +datum=NAD83 +no_defs ## # A tibble: 6 x 6 ## GEOID NAME variable value summary_value ## <chr> <chr> <chr> <dbl> <dbl> ## 1 48201100000 Census Tract 1000 P0050003 2082 4690 ## 2 48201100000 Census Tract 1000 P0050004 1047 4690 ## 3 48201100000 Census Tract 1000 P0050006 134 4690 ## 4 48201100000 Census Tract 1000 P0040003 1070 4690 ## 5 48201210900 Census Tract 2109 P0050003 35 1620 ## 6 48201210900 Census Tract 2109 P0050004 1195 1620 ## # ... with 1 more variables: geometry <S3: sfc_MULTIPOLYGON>
We notice that there are four entries for each Census tract, with each entry representing one of our requested variables. The
summary_value column represents the value of the summary variable, which is total population in this instance. When a summary variable is specified in
summary_moe columns will be returned.
With this information, we can set up an analysis pipeline in which we calculate a new percent-of-total column; recode the Census variable names into more intuitive labels; and visualize the result for each group in a faceted plot.
library(forcats) harris %>% mutate(pct = 100 * (value / summary_value), variable = fct_recode(variable, White = "P0050003", Black = "P0050004", Asian = "P0050006", Hispanic = "P0040003")) %>% ggplot(aes(fill = pct, color = pct)) + facet_wrap(~variable) + geom_sf() + coord_sf(crs = 26915) + scale_fill_viridis() + scale_color_viridis()
Beyond this, you might be interested in writing your dataset to a shapefile or GeoJSON for use in external GIS or visualization applications. You can accomplish this with the
st_write function in the sf package:
library(sf) st_write(orange, "orange.shp")
Your tidycensus-obtained dataset can now be used in ArcGIS, QGIS, Tableau, or any other application that reads shapefiles.
There is a lot more you can do with the spatial functionality in tidycensus, including more sophisticated visualization and spatial analysis; look for updates on my blog and in this space.