Data wrangling

GEOG 30323

February 20, 2024

Data wrangling

In real-world data analysis, your data will likely:

Have missing/possibly incorrect values
Be in a format unsuitable for data analysis
Be spread across multiple files, possibly of different types
Need re-shaping or summarization to draw meaningful conclusions

Fortunately, pandas can help you with all of this!

Subsetting

Frequently, you’ll have way more data than you need!
Datasets can be reduced in size by indexing and subsetting
Let’s read in the colleges dataset as a demo

import pandas as pd

full_url = 'http://personal.tcu.edu/kylewalker/data/colleges.csv'
full_df = pd.read_csv(full_url, encoding = 'latin_1')
full_df.shape

By column name

Let’s drop most of the columns in the dataset with .filter()

cols_to_keep = ['INSTNM', 'STABBR', 'GRAD_DEBT_MDN_SUPP']
debt_df = full_df.filter(cols_to_keep)
debt_df.columns = ['name', 'state', 'debt']
debt_df.head()

By row position

Data frames can be sliced like lists and strings

# Retain row 0 up until but not including row 10
debt_df[0:10]

By row or column index

Selecting by row or column index available in the .loc[] method (note the brackets)

example1 = debt_df.set_index('name')
example1.loc['Amridge University':'Alabama State University']

By value

Often, you’ll want to keep rows that have a certain column value, or exclude rows based on that value
The data frame method .query() will “query” your dataset based on an expression
Expressions use conditional operators; can be combined with & (and) and | (or)

By value

debt_df1 = debt_df.query("debt != 'PrivacySuppressed'")
debt_df1.head()

By value

tx_debt_df = debt_df.query('debt != "PrivacySuppressed" & state == "TX" ')
tx_debt_df.head()

# Alternatively, use an index-based method

# tx_debt_df = debt_df.loc[(debt_df['debt'] != 'PrivacySuppressed') & (debt_df['state'] == 'TX')]

By value

states = ['OK', 'NM', 'TX', 'LA']
# Use `in @states` to get values in the list
# @ operator allows for use of variables in the query
sw_debt_df = debt_df.query("debt != 'PrivacySuppressed' & state in @states")

sw_debt_df.head()

Creating new columns

New columns can be created based on specified values, or as derivatives of other columns, using mathematical operators or the .assign() method
Let’s demo with a simulated data frame:

import numpy as np
np.random.seed(1983)

df1 = pd.DataFrame({'col1': np.random.randint(1, 100, 10), 
                    'col2': np.random.randint(1, 100, 10), 
                    'col3': np.random.randint(1, 100, 10)})

Creating new columns

# With .assign()
df2 = df1.assign(col4 = df1.col1 + df1.col2)

# With index-based labeling
df2['col5'] = df2['col3'] / df2['col4']

df2.head()

`dtype` conversion

To do numerical analysis, our numeric data have to be stored as numbers!
To convert: use the .astype() method

sw_debt_num = sw_debt_df.assign(debtnum = sw_debt_df.debt.astype(float))

sw_debt_num.head()

Missing data

Commonly, all of the data you need will not be found in your data set!
Possible solutions:
- Delete all rows that have missing data
- Fill in missing data with a specified value
- Interpolate missing values

Missing data

.dropna() method: delete all rows (or columns) that have any missing values (NaN in pandas)

sw_debt_clean = sw_debt_num.dropna()

sw_debt_clean.head()

Missing data

.fillna() method: fill in missing data with a specified value

sw_debt_fill = sw_debt_num.fillna(sw_debt_num.median())

sw_debt_fill.head()

Method chaining

pandas data wrangling methods can be “chained” together to compute a data wrangling workflow all at once

cols_to_keep = ['INSTNM', 'STABBR', 'GRAD_DEBT_MDN_SUPP']
states = ['OK', 'NM', 'TX', 'LA']

sw_debt_clean = (full_df
  .filter(cols_to_keep)
  .set_axis(['name', 'state', 'debt'], axis = 'columns')
  .query("debt != 'PrivacySuppressed' & state in @states")
  .assign(debtnum = lambda x: x.debt.astype(float))
  .dropna()
)

Group-wise data analysis

Thus far, we’ve focused on characteristics of data within a particular group
Common question: how do characteristics vary by group?
In pandas: .groupby() method!

Split-apply-combine

Wickham (2011): the “split-apply-combine” model of data analysis

Process:

Data are split by some characteristic into groups
We apply a function to each of the groups
The resultant data are combined back into a single dataset

`.groupby()` in `pandas`

sw_grouped = sw_debt_clean.groupby('state')

sw_grouped.debtnum.mean()

# Result

state
LA    15876.255319
NM    16237.466667
OK    17030.860759
TX    15009.426582

Grouped visualization in `seaborn`

import seaborn as sns
sns.set(style = "darkgrid")

sns.boxplot(x = 'state', y = 'debtnum', data = sw_debt_clean)

Grouped visualization in `seaborn`

Faceting or small multiples: breaking down a plot by a grouping variable into multiple plots

grid = sns.FacetGrid(data = sw_debt_clean, col = 'state', col_wrap = 2)
grid.map(sns.kdeplot, 'debtnum')

Merging data

Commonly, you’ll have data in two - or multiple! - datasets that you’ll want to combine into one
Simulated data:

np.random.seed(123456)

m1 = pd.DataFrame({'type': ['a', 'b', 'c', 'd', 'e', 'f'], 
                  'ind1': np.random.randint(1, 100, 6), 
                  'ind2': np.random.randint(1, 100, 6)})

m2 = pd.DataFrame({'type': ['a', 'b', 'c', 'd', 'e', 'f'], 
                  'ind3': np.random.randint(1, 100, 6), 
                  'ind4': np.random.randint(1, 100, 6)})

The `.merge()` method in `pandas`

m3 = m1.merge(m2, on = 'type')

Types of merges in `pandas`

Options for merging (the how parameter): 'inner' (default), 'left', 'right', and 'outer'
Simulated data:

m4 = pd.DataFrame({'type': ['d', 'e', 'f', 'g', 'h', 'i'], 
                  'ind5': np.random.randint(1, 100, 6), 
                  'ind6': np.random.randint(1, 100, 6)})

Inner merges

m5 = m1.merge(m4, on = 'type', how = 'inner')

Left merges

m5 = m1.merge(m4, on = 'type', how = 'left')

Right merges

m5 = m1.merge(m4, on = 'type', how = 'right')

Outer merges

m5 = m1.merge(m4, on = 'type', how = 'outer')

The “shape” of data

Long (“tidy”) data:
- Each variable forms a column;
- Each observation forms a row;
- Each type of observational unit forms a table
Wide data: column headers represent values, not variable names

Example: World Bank data

Long format:

from pandas_datareader import wb
countries = ['ZA', 'BR', 'US']
tfr = wb.download(indicator = 'SP.DYN.TFRT.IN', 
                    country = countries, start = 1960, 
                    end = 2019).reset_index()
tfr.head()

Long to wide

.pivot() method in pandas

tfr_wide = tfr.pivot(index = 'year', columns = 'country',
                    values = 'SP.DYN.TFRT.IN')

tfr_wide.head()

Plotting “wide” data

tfr_wide.plot()

Wide to long

pd.melt() function in pandas

tfr_long = pd.melt(tfr_wide.reset_index(), id_vars = 'year', 
                   var_name = 'country', value_name = 'tfr')

tfr_long.head()

Plotting long-form data

tfr_long['year'] = tfr_long['year'].astype(int)
sns.lineplot(x = "year", y = "tfr",
            hue = "country", data = tfr_long)

Data wrangling

GEOG 30323

February 20, 2024

Data wrangling

Subsetting

By column name

By row position

By row or column index

By value

By value

By value

By value

Creating new columns

Creating new columns

dtype conversion

Missing data

Missing data

Missing data

Method chaining

Group-wise data analysis

Split-apply-combine

.groupby() in pandas

Grouped visualization in seaborn

Grouped visualization in seaborn

Merging data

The .merge() method in pandas

Types of merges in pandas

Inner merges

Left merges

Right merges

Outer merges

The “shape” of data

Example: World Bank data

Long to wide

Plotting “wide” data

Wide to long

Plotting long-form data

`dtype` conversion

`.groupby()` in `pandas`

Grouped visualization in `seaborn`

Grouped visualization in `seaborn`

The `.merge()` method in `pandas`

Types of merges in `pandas`