Writing about visualization, demographics, dashboards, and spatial data science.

Interested in learning more? Hire me for a workshop or to consult on your next project. See the Services page for more details.

Visualizing the U.S. Hispanic population by state

· by Kyle Walker · Read in about 7 min · (1436 Words)
R rCharts

Welcome! This is the first post of my effort to document my foray into developing interactive data visualizations for use in my teaching. Hopefully these examples will be of use to some readers who are interested in creating their own visualizations.

I’ll first provide a bit of background. I’m a geography professor at Texas Christian University in Fort Worth, and started getting interested in data visualization when putting together materials for a course I taught in population geography last spring. Visuals are essential for social science instructors; however, it is not always easy to find publicly available images that are both effective and suitable for what you want to teach (I imagine many of you, like me, have gone on many a failed Google image search). So, I started looking into developing my own materials. I happened upon the incredible D3 JavaScript library by Mike Bostock and the stunning graphics from the Institute for Health Metrics and Evaluation and was immediately blown away. Interactive examples like this population pyramid can be very effective for conveying social science concepts.

As a relative newcomer to JavaScript, however, I looked into what other resources were available for creating these types of interactive visualizations, especially in languages I have more of a familiarity with (R and Python, specifically). Thus far, I’ve been using the fantastic rCharts R package by Ramnath Vaidyanathan, which provides an R wrapper for several JavaScript charting libraries, and the the googleVis package, which is an R interface for the Google Charts API, and among other things allows R users to create Hans Rosling-style motion charts.. I’ve also started looking into tools like plotly, which has both R and Python APIs, and hope to create my own D3 visualizations from scratch eventually.

When possible, my examples on this site will use open data/open source tools and will be available on GitHub, so that anyone interested can use and adapt these examples as they need.

My first example is a visualization of the composition of Hispanic populations by state in the U.S., for the ten states with the largest Hispanic populations in 2010. In my introductory geography course, I’m going to be soon discussing shifts in the racial & ethnic composition of the U.S. Sometime between 2040 and 2050, non-Hispanic whites are projected to become a minority in the U.S., in large part due to continued growth of the Hispanic population. However, I find that many discussions of this demographic shift in the media tend to homogenize the Hispanic population, which is not what I want to convey to my students; as such, I wanted to find a way to visualize its diversity. I came across these interesting interactive maps from the Pew Research Center, and downloaded the data they made available to see what I might do with it.

The data made available by the Pew Research Center are in Excel format. There are many libraries for reading Excel data into R, but they often are not straightforward to use, so I first opened the file in Excel and saved it as a CSV for ease of use. In their original form, the data are not in a great format for visualization in R; as such, I needed to do some munging, with help from Hadley Wickham’s excellent packages.

# First, download the Excel file from the Pew Research Center, and save it as a CSV in your working directory


dat <- read.csv("all_counties_by_top_six_groups.csv")

keep <- seq(1, 25, 3)

dat <- dat[,keep]

nms <- c('Name', 'Total.Hisp', 'Mexican', 'Puerto.Rican', 'Cuban', 'Salvadoran', 'Dominican', 'Guatemalan', 'Other')

names(dat) <- nms

dat <- dat[-c(1:3),]

The above code simply cleans up the data to shape it into a nicely formatted data frame, and subsets it to get the population counts that we need. However, there are still some steps to take before the data can be visualized. Given the original data format, R has read in all my numeric data as factors, which wouldn’t let me make the kinds of manipulations I needed to do to aggregate the data by state. Such aggregation also required some string manipulation, so that I could identify which counts correspond to each state (given that the geographic identifiers in the data are presented as ‘County, State’). The code that follows cleans up the data even further and aggregates each numeric column by state.

dat <- cbind(dat, ldply(str_split(dat$Name, ", ")))

names(dat) <- c(nms, 'County', 'State')

convCols <- 2:9

dat[,convCols] <- apply(dat[,convCols], 2, function(x) as.numeric(as.character(gsub(",", "", x))))

sums <- ddply(dat, .(State), numcolwise(sum))

sorteddf <- sums[order(-sums$Total.Hisp),][1:10,]

I’ve now identified the 10 states with the largest Hispanic populations, and aggregated the different ancestry columns by state. I now needed to decide how to visualize these data. I elected to use the dimple D3 library, which is available through rCharts. My hope was to create something like this horizontal 100% chart, which would allow direct comparison of the Hispanic population composition of these states. Fortunately, rCharts makes this straightforward. I first created a new data frame that held percentages instead of raw counts, reshaped it into a suitable format, and called rCharts’ dPlot function to create the chart.

newdf <- data.frame(sorteddf$State)

vals <- c('Mexican', 'Puerto.Rican', 'Cuban', 'Salvadoran', 'Dominican', 'Guatemalan', 'Other')

for (v in vals) {
  newdf[[v]] <- round(((sorteddf[[v]] / sorteddf$Total.Hisp) * 100), 1)

names(newdf) <- c('State', vals)

df.melt <- melt(newdf, variable.name = 'Ancestry', value.name = 'Share')

d1 <- dPlot(
  x = "Share", 
  y = "State", 
  groups = "Ancestry", 
  data = df.melt, 
  type = 'bar')

#Here, set the chart options to tell rCharts how to format the visualization  
d1$xAxis(type = "addPctAxis")
d1$yAxis(type = "addCategoryAxis", orderRule = "State")

d1$legend( x = 60, y = 10, width = 700, height = 20, horizontalAlign = "left", orderRule = "Ancestry")

Below, you can see the result:

The plot has accomplished what I hoped - displaying the considerable diversity of the Hispanic population across different states in the U.S. The key here to the chart is the interactivity; while I could have produced a static visualization just like this, each component of the chart provides a tooltip on hover that gives specific information about its content. I now have a more interactive document that I can explore with my students.

As the chart reveals, Hispanics are generally of Mexican origin in several states, including my state of Texas, where 84% of Hispanics are of Mexican ancestry; this will be the frame of reference for my students. However, I can show students how in other parts of the country, such as my old home of New York, ‘Hispanic’ means something very different, as individuals of Mexican heritage only make up 13% of the state’s Hispanic population. It is also interesting to see how the ‘Other’ category varies by state. In some states (New York, New Jersey, New Mexico, Florida), this category is very large. The Pew Center report provides some additional information on this; for example, many Colombians, Hondurans, and Peruvians live in the Miami area, and Queens, NYC has a large Ecuadorian population. In New Mexico, the data reflect the Spanish and Native American heritage of many Hispanics in the state.

There are still some improvements that could be made; for example, in some browsers, the y-axis title is partially hidden, which I need to look into further. Also, in order to get the effects to work correctly, I had to modify the version of Dimple in the HTML to point to version 1.1.3 (rCharts is still on 1.1.1).

To create this chart on your computer, follow these steps:

  1. Visit the Pew Research Center site, download the Excel file they make available, and save it as a CSV in your working directory (don’t change the name, just the type).
  2. Be sure that you have the following R packages installed: stringr, plyr, rCharts, and reshape2. rCharts is not yet on CRAN, so you’ll need to install it from GitHub with the devtools package. I use the dev branch of rCharts, which has the latest updates; you can install this with the command, devtools::install_github('rCharts', 'ramnathv', ref='dev').
  3. Run the following command: source("https://raw.github.com/walkerke/teaching-with-datavis/master/hispanics-by-state.R")
  4. Type d1 in your console, and you’ll have your chart!

Alternatively, feel free to grab the code from GitHub and modify it as you wish.

I’d love to hear your feedback; you can send me an email at kyle.walker@tcu.edu, or connect with me on Twitter.

Thanks to: