Multivariate data exploration

GEOG 30323

February 13, 2024

Why visualize data?

The greatest value of a picture is when it forces us to notice what we never expected to see.

  • Tukey (1977) quoted in Yau (2013)

Exploring data visually

Source: Yau, Data Points p. 137

Our schedule:

  • Current activities: data exploration through visualization with common chart types
  • Weeks 9-12: deep dive into data visualization
    • More complex chart types
    • How to customize your seaborn plots
    • Best practices in data visualization
    • Interactive web-based graphics
    • Maps!

Exploratory chart types

  • Comparing categories: bar chart, dot plot
  • Part-to-whole: pie chart
  • Change over time: line chart
  • Connections and relationships: scatter plot

Many, many more in these categories - these are just our focus for today!

Python and the web

  • A brief aside: With Python, data on the web is at your fingertips (our topic for Week 8)
  • This week, you will get a preview
import pandas as pd

mx_csv = "http://personal.tcu.edu/kylewalker/mexico.csv"
mx = pd.read_csv(mx_csv)
mx.head()

Comparing categories

How about sorting our data?

mx_sorted = mx.sort_values(by = 'gdp08', ascending = False)
mx_sorted.head()

Bar charts

Source: FiveThirtyEight.com

Bar charts

  • Length or height of bars proportional to data values, allowing for comparisons between categories
  • The value axis of bar charts must start at zero!!!
  • Recommendation: sort your data values for ease of interpretation

Bar chart with non-zero origin

Source: Fox News via FlowingData.com

Bar charts in Python

import seaborn as sns
sns.set(style = "darkgrid")

mx.plot(x= 'name', y = 'gdp08', kind = 'bar')

Bar charts in seaborn

sns.barplot(x = 'gdp08', y = 'name', data = mx_sorted)

Dot plots

Source: FiveThirtyEight.com

Dot plots

  • Can be preferable to bar charts - values determined by position along axis rather than bar heights
  • In turn, zero origin not strictly necessary (though consider the context)
  • Sorted data also preferable for dot plots

Dot plots in seaborn

sns.stripplot(x = 'gdp08', y = 'name', data = mx_sorted)

Part-to-whole

  • Categories in relationship to the entire population of values
  • Examples: pie chart, waffle chart, 100% bar chart, tree map
  • Must sum to 100%!

Pie charts in Python

zac = mx[mx.name == 'Zacatecas'].drop(['name', 'FID', 'gdp08', 'mus09'], axis = 1).squeeze()
zac.name = 'Zacatecas'
zac.plot(kind = 'pie', figsize = (6, 6))

Problems with pie charts

Source: Fox Chicago via FlowingData.com

Problems with pie charts

Source: Data to Viz

Line charts

Source: FiveThirtyEight.com

Line charts in seaborn

dfw = pd.read_csv('http://personal.tcu.edu/kylewalker/data/pct_college.csv')

sns.lineplot(x = "year", y = "pct_college", 
             hue = "county", data = dfw)

Scatter plots

  • Question: how do the values in two columns covary?
  • Scatter plot: each observation represented by a point; position along x axis dictated by one column value; position along y axis dictated by other column value
  • Regression line: visual representation of estimated statistical relationship between X and Y

Scatter plots

Source: FiveThirtyEight.com

Scatter plots in seaborn

sns.scatterplot(x = "mus09", y = "pri10", data = mx)

Scatter plots in seaborn

  • Also available in the lmplot and regplot functions
sns.lmplot(data = mx, x = 'mus09', y = 'pri10')

Correlation

  • Correlation coefficient: statistical representation of how two samples covary; ranges between -1 (negative correlation) and +1 (positive correlation)
  • In pandas: .corr()
  • Beware of spurious correlations! http://tylervigen.com/spurious-correlations
mx['mus09'].corr(mx['pri10'])

0.41639990565936902 # the result