class: center, middle, inverse, title-slide # Exploring US Census data in R with tidycensus ## ISDS R Users Group Webinar ### Kyle Walker, TCU ### February 19, 2018 --- ## Follow along! Materials for the webinar are found at https://github.com/walkerke/isds-webinar, and include: * A link to the webinar slides (https://walkerke.github.io/isds-webinar) * [The R Markdown source code (which you can run in RStudio)](https://raw.githubusercontent.com/walkerke/isds-webinar/master/index.Rmd) --- ## About me * Work: professor at TCU/spatial data science consultant * Research: urban geography, spatial demography, open data science * Software: tidycensus, tigris, idbr * Forthcoming book: _Analyzing the US Census with R_ --- ## Disclaimer This webinar uses Census Bureau data but is not endorsed or certified by the Census Bureau. --- ## What we'll cover * Census data: the basics * Acquiring data with tidycensus * Margins of error in tidycensus * Mapping with tidycensus and ggplot2 --- ## Census data: tables and features Work with US Census data commonly includes two components: * __Data tables__ obtained from the decennial Census, American Community Survey, or other sources * __Geographic features__ obtained from the Census's TIGER/Line database --- ## Census tables: American FactFinder <img src=img/factfinder.PNG style="width: 700px"> --- ## Census geography: TIGER/Line shapefiles <img src=img/dropdown.PNG style="width: 400px"> --- ## tidycensus: get Census data in R * R package first released in mid-2017 * Allows R users to obtain decennial Census and ACS data pre-formatted for use with tidyverse tools (dplyr, ggplot2, etc.) * Optionally returns geographic data as simple feature geometry for common Census geographies --- ## What tidycensus can do ```r library(tidycensus) tx <- get_acs(geography = "county", state = "TX", variables = "B19013_001", geometry = TRUE) ggplot(tx, aes(fill = estimate)) + geom_sf() ``` <img src="index_files/figure-html/unnamed-chunk-1-1.png" width="432" /> --- ## tidycensus: the basics * The `get_decennial()` and `get_acs()` functions give access to the decennial Census and American Community Survey, respectively * Required arguments: `geography` and `variables` * Default `year`: 2010 (decennial Census) and 2012-2016 (ACS) --- ## How tidycensus works * tidycensus formats your arguments to make a request to the appropriate Census or ACS Application Programming Interface (API) * A Census api is required: obtain one from https://api.census.gov/data/key_signup.html and set with: ```r census_api_key("YOUR KEY", install = TRUE) ``` * A __tibble__ (or sf tibble) is returned in tidy (long) format by default containing the requested data --- ## Census geography <img src=img/hierarchy.PNG style="width: 550px"> --- ## Census variables * __Variables__ in tidycensus are identified by their Census ID, e.g. `B19013_001` * Entire __tables__ of variables can be requested with the `table` argument, e.g. `table = "B19001"` * Users can request multiple variables at a time, and set custom names with a named vector --- ## Searching for Census variables * Variable definitions for a given dataset can be loaded into R with the `load_variables()` function, and explored in RStudio with `View()` ```r v16 <- load_variables(2016, "acs5", cache = TRUE) View(v16) ``` --- ## "Tidy" Census data ```r income <- get_acs(geography = "state", table = "B19001") income ``` ``` ## # A tibble: 884 x 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 01 Alabama B19001_001 1851061 5444 ## 2 01 Alabama B19001_002 179345 2820 ## 3 01 Alabama B19001_003 126347 2394 ## 4 01 Alabama B19001_004 118846 2556 ## 5 01 Alabama B19001_005 114040 2021 ## 6 01 Alabama B19001_006 105954 2142 ## 7 01 Alabama B19001_007 101692 1983 ## 8 01 Alabama B19001_008 93118 2053 ## 9 01 Alabama B19001_009 90043 1989 ## 10 01 Alabama B19001_010 75980 1629 ## # ... with 874 more rows ``` --- ## "Wide" Census data ```r inc_wide <- get_acs(geography = "state", table = "B19001", output = "wide") inc_wide ``` ``` ## # A tibble: 52 x 36 ## GEOID NAME B19001_001E B19001_001M B19001_002E B19001_002M B19001_003E ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 01 Alab~ 1851061 5444 179345 2820 126347 ## 2 02 Alas~ 250235 1399 9216 576 8418 ## 3 04 Ariz~ 2448919 7179 183573 2815 125264 ## 4 05 Arka~ 1141480 3883 101637 2052 82519 ## 5 06 Cali~ 12807387 18852 728895 6027 629347 ## 6 08 Colo~ 2051616 4112 114990 2492 80302 ## 7 09 Conn~ 1354713 3509 75787 1961 50565 ## 8 10 Dela~ 348051 1568 20271 829 12613 ## 9 11 Dist~ 276546 1294 28179 1034 12205 ## 10 12 Flor~ 7393262 22885 556637 4436 398394 ## # ... with 42 more rows, and 29 more variables: B19001_003M <dbl>, ## # B19001_004E <dbl>, B19001_004M <dbl>, B19001_005E <dbl>, ## # B19001_005M <dbl>, B19001_006E <dbl>, B19001_006M <dbl>, ## # B19001_007E <dbl>, B19001_007M <dbl>, B19001_008E <dbl>, ## # B19001_008M <dbl>, B19001_009E <dbl>, B19001_009M <dbl>, ## # B19001_010E <dbl>, B19001_010M <dbl>, B19001_011E <dbl>, ## # B19001_011M <dbl>, B19001_012E <dbl>, B19001_012M <dbl>, ## # B19001_013E <dbl>, B19001_013M <dbl>, B19001_014E <dbl>, ## # B19001_014M <dbl>, B19001_015E <dbl>, B19001_015M <dbl>, ## # B19001_016E <dbl>, B19001_016M <dbl>, B19001_017E <dbl>, ## # B19001_017M <dbl> ``` --- class: middle, center, inverse ## Margins of error in tidycensus --- ## Margins of error in the ACS * American Community Survey: _sample_ of approximately 3 million American households per year * Geographies of population > 65,000 available in the 1-year ACS; all geographies (starting with block groups) available in the 5-year ACS * ACS __estimates__ associated with __margins of error__; default confidence level of 90 percent --- ## Margins of error in `get_acs()` ```r az <- get_acs(geography = "county", variables = "B19013_001", state = "AZ") head(az) ``` ``` ## # A tibble: 6 x 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 04001 Apache County, Arizona B19013_001 32460 1381 ## 2 04003 Cochise County, Arizona B19013_001 45383 1470 ## 3 04005 Coconino County, Arizona B19013_001 51106 1578 ## 4 04007 Gila County, Arizona B19013_001 40593 1420 ## 5 04009 Graham County, Arizona B19013_001 47422 3914 ## 6 04011 Greenlee County, Arizona B19013_001 51813 5787 ``` --- ## Visualizing margins of error ```r az %>% mutate(NAME = str_replace(NAME, " County, Arizona", "")) %>% ggplot(aes(x = estimate, y = reorder(NAME, estimate))) + geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) + geom_point(color = "red", size = 3) + scale_x_continuous(labels = scales::dollar) + labs(title = "Household income by county in Arizona", subtitle = "2012-2016 American Community Survey", y = "", x = "ACS estimate (bars represent margin of error)") ``` --- ## Visualizing margins of error <img src=img/az.png style="width: 800px"> --- ## Derived margins of error in tidycensus * Margins of error for derived estimates available with the `moe_sum()`, `moe_prop()`, `moe_ratio()`, and `moe_product()` functions * When possible, attempt to locate pre-computed derived estimates in the ACS before computing yourself (e.g. in Data Profile or Subject Tables) --- class: middle, center, inverse ## Mapping data obtained with tidycensus --- ## Census "geometry" in R * __tigris__ package: enables users to obtain and load Census geography as R objects * __sf__ package: next-generation model for representing vector spatial data in R as _simple features_ Example: ```r library(tigris) mi <- counties("MI", cb = TRUE, class = "sf") ``` --- ## Census "geometry" in R ```r plot(mi$geometry) ``` <img src="index_files/figure-html/unnamed-chunk-9-1.png" width="672" /> --- ## Geometry in tidycensus * For common geographies (`"state"`, `"county"`, `"tract"`, `"block group"`, `"block"`, and `"zcta"`) tidycensus can load simple feature geometry with the argument `geometry = TRUE` Example: ```r cook <- get_acs(geography = "tract", state = "IL", county = "Cook", variables = c(hhincome = "B19013_001"), geometry = TRUE) ``` --- ## Geometry in tidycensus ```r head(cook) ``` ``` ## Simple feature collection with 6 features and 5 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -87.69738 ymin: 41.72902 xmax: -87.62394 ymax: 42.00677 ## epsg (SRID): 4269 ## proj4string: +proj=longlat +datum=NAD83 +no_defs ## GEOID NAME variable estimate ## 1 17031842400 Census Tract 8424, Cook County, Illinois hhincome 39652 ## 2 17031843700 Census Tract 8437, Cook County, Illinois hhincome 99821 ## 3 17031010502 Census Tract 105.02, Cook County, Illinois hhincome 29870 ## 4 17031020602 Census Tract 206.02, Cook County, Illinois hhincome 45349 ## 5 17031030701 Census Tract 307.01, Cook County, Illinois hhincome 34671 ## 6 17031031100 Census Tract 311, Cook County, Illinois hhincome 58947 ## moe geometry ## 1 11722 MULTIPOLYGON (((-87.63932 4... ## 2 21074 MULTIPOLYGON (((-87.69676 4... ## 3 4328 MULTIPOLYGON (((-87.66573 4... ## 4 7176 MULTIPOLYGON (((-87.69738 4... ## 5 8540 MULTIPOLYGON (((-87.66007 4... ## 6 11074 MULTIPOLYGON (((-87.66842 4... ``` --- ## Mapping Census data: __ggplot2__ * The `geom_sf()` function in ggplot2 (development version) allows for the mapping of simple features objects Example: ```r library(viridis) ggplot(cook, aes(fill = estimate, color = estimate)) + geom_sf() + theme_minimal() + coord_sf(crs = 26916, datum = NA) + scale_color_viridis(option = "cividis", guide = FALSE) + scale_fill_viridis(option = "cividis", labels = scales::dollar) + labs(title = "Median household income, 2012-2016 ACS", subtitle = "Census tracts in Cook County, Illinois", fill = "") ``` --- ## Mapping Census data: __ggplot2__ <img src=img/cook.png style="width: 700px"> --- ## Mapping Census data: __mapview__ ```r library(mapview) mapview(cook, zcol = "estimate", legend = TRUE) ```
--- ## Interactive exploration: __leaflet__ <iframe src="https://walkerke.github.io/urbanslides/chicago/img/il_income.html" height = "450" width = "800" frameborder="0" scrolling="no"></iframe> .footnote[[Click here for source code](https://gist.github.com/walkerke/2d534dc0dd638ccdbaeef1ca83f4fe86)] --- ## Interactive exploration: __plotly__ <iframe src="brushing.html" height = "500" width = "100%" frameborder="0" scrolling="no"></iframe> .footnote[[Click here for source code](https://gist.github.com/walkerke/93bfe80bb7735aa6265a61013eaed3fa)] --- ## Thank you! For more: * Hire me as a consultant: <kwalkerdata@gmail.com> * Take my [DataCamp](https://www.datacamp.com/) course on US Census data in R - coming this spring * Join my mailing list: http://eepurl.com/cPGKZD * Follow me on Twitter: [@kyle_e_walker](https://twitter.com/kyle_e_walker) <style> h1, h2, h3 { color: #386890; } a { color: #90b4d2; } .inverse { background-color: #386890; } .remark-code-line { font-size: 90%; } </style>