To get started working with tidycensus, users should load the package along with the tidyverse package, and set their Census API key. A key can be obtained from http://api.census.gov/data/key_signup.html.
library(tidycensus) library(tidyverse) census_api_key("YOUR API KEY GOES HERE")
There are two major functions implemented in tidycensus:
get_decennial, which grants access to the 1990, 2000, and 2010 decennial US Census APIs, and
get_acs, which grants access to the 5-year American Community Survey APIs. In this basic example, let’s look at median gross rent by state in 1990:
m90 <- get_decennial(geography = "state", variables = "H043A001", year = 1990) head(m90)
## # A tibble: 6 x 4 ## GEOID NAME variable value ## <chr> <chr> <chr> <dbl> ## 1 01 Alabama H043A001 325 ## 2 02 Alaska H043A001 559 ## 3 04 Arizona H043A001 438 ## 4 05 Arkansas H043A001 328 ## 5 06 California H043A001 620 ## 6 08 Colorado H043A001 418
The function returns a tibble with four columns by default:
GEOID, which is an identifier for the geographical unit associated with the row;
NAME, which is a descriptive name of the geographical unit;
variable, which is the Census variable represented in the row; and
value, which is the value of the variable for that unit. By default, tidycensus functions return tidy data frames in which rows represent unit-variable combinations; for a wide data frame with Census variable names in the columns, set
output = "wide" in the function call.
As the function has returned a tidy object, we can visualize it quickly with ggplot2:
m90 %>% ggplot(aes(x = value, y = reorder(NAME, value))) + geom_point()
Getting variables from the Census or ACS requires knowing the variable ID - and there are thousands of these IDs across the different Census files. To rapidly search for variables, use the
load_variables function. The function takes two required arguments: the year of the Census or endyear of the ACS sample, and the dataset - one of
"acs5". For ideal functionality, I recommend assigning the result of this function to a variable, setting
cache = TRUE to store the result on your computer for future access, and using the
View function in RStudio to interactively browse for variables.
v15 <- load_variables(2016, "acs5", cache = TRUE) View(v15)
By filtering for “median age” I can quickly view the variable IDs that correspond to my query.
American Community Survey data differ from decennial Census data in that ACS data are based on an annual sample of approximately 3 million households, rather than a more complete enumeration of the US population. In turn, ACS data points are estimates characterized by a margin of error. tidycensus will always return the estimate and margin of error together for any requested variables. In turn, when requesting ACS data with tidycensus, it is not necessary to specify the
"M" suffix for a variable name. Let’s fetch median household income data from the 2011-2015 ACS for counties in Vermont; the endyear is not necessary here as the function defaults to 2015.
vt <- get_acs(geography = "county", variables = c(medincome = "B19013_001"), state = "VT") vt
## # A tibble: 14 x 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 50001 Addison County, Vermont medincome 61020 2194 ## 2 50003 Bennington County, Vermont medincome 51489 3350 ## 3 50005 Caledonia County, Vermont medincome 46931 1876 ## 4 50007 Chittenden County, Vermont medincome 66414 2090 ## 5 50009 Essex County, Vermont medincome 39467 2540 ## 6 50011 Franklin County, Vermont medincome 58884 2002 ## 7 50013 Grand Isle County, Vermont medincome 64295 2932 ## 8 50015 Lamoille County, Vermont medincome 53316 4047 ## 9 50017 Orange County, Vermont medincome 54263 1743 ## 10 50019 Orleans County, Vermont medincome 43959 2047 ## 11 50021 Rutland County, Vermont medincome 50029 1717 ## 12 50023 Washington County, Vermont medincome 58171 1989 ## 13 50025 Windham County, Vermont medincome 50917 1775 ## 14 50027 Windsor County, Vermont medincome 54763 2123
The output is similar to a call to
get_decennial, but instead of a
moe columns for the ACS estimate and margin of error, respectively.
moe represents the default 90 percent confidence level around the estimate; this can be changed to 95 or 99 percent with the
moe_level parameter in
get_acs if desired.
As we have the margin of error, we can visualize the uncertainty around the estimate:
vt %>% mutate(NAME = gsub(" County, Vermont", "", NAME)) %>% ggplot(aes(x = estimate, y = reorder(NAME, estimate))) + geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) + geom_point(color = "red", size = 3) + labs(title = "Household income by county in Vermont", subtitle = "2012-2016 American Community Survey", y = "", x = "ACS estimate (bars represent margin of error)")