Modules, packages, and EDA in Python

GEOG 30323

February 6, 2024

Time for data!

Source: bigdatapix.tumblr.com

The data analysis process

Source: Wickham and Grolemund, R for Data Science

Exploratory data analysis

“Detective work” to summarize and explore datasets

Includes:

Data acquisition and input
Data cleaning and wrangling (“tidying”)
Data transformation and summarization
Data visualization

Your core Python tools for EDA: NumPy, pandas, and seaborn/matplotlib

Modules and packages

Module: file containing variables, functions, etc. that can be imported into a Python session with the import statement
Package: directory of modules that perform similar tasks (e.g. data visualization, statistics, etc.)
Thousands upon thousands of Python packages available - that do just about anything!

Built-in packages

Many packages are included in stdlib, the standard library that ships with Python
Popular modules: re for regular expressions; os for operating system functions; random for random-number generation; and many more. Full list: https://docs.python.org/3/library/

The PyData ecosystem

Source: Jake VanderPlas, SciPy 2015 Keynote

NumPy

Extension to Python; the core Python package for numerical computing
Standard import: import numpy as np
Data structure: the NumPy array. Sort of like a list - but with more methods, and can be multidimensional

import numpy as np

y = np.array([[2, 4, 6, 8, 10, 12], 
             [1, 3, 5, 7, 9, 11], 
             [10, 12, 14, 18, 22, 14], 
             [9, 3, 3, 3, 3, 1]])

Pandas

Built on top of NumPy; adds support for table-like data structures in Python
Standard import: import pandas as pd
Sequences of data are stored as Series objects, which collectively form DataFrames

import pandas as pd

df = pd.DataFrame(y, columns = ['x' + str(num) for num in range(1, 7)])

The pandas DataFrame

Commonly, DataFrames are created by reading in external data, like CSV files
Download link: https://personal.tcu.edu/kylewalker/data/grad_rates.csv

# To read in CSV files, we use the pd.read_csv function 
grad = pd.read_csv("grad_rates.csv")

Tutorial: working with external data in Colab

The pandas DataFrame

Each observation forms a row, defined by an index; attributes of those observations are found in the columns of the DataFrame

Columns are accessible as indices, e.g. grad['rate'], or as attributes of the data frame, e.g. grad.rate

The Python namespace

When you declare variables, define functions, import modules, etc., you are adding objects to the Python namespace
To remove objects from the Python namespace, use the del statement

Imports and the namespace

Imported modules can be referenced in multiple ways:

# All of these are equivalent

import pandas
pandas.read_csv("grad_rates.csv")

import pandas as pd
pd.read_csv("grad_rates.csv")

from pandas import *
read_csv("grad_rates.csv")

Levels of measurement

Nominal: qualitative, descriptive, categories
Ordinal: ordering or ranking; however, no information about distance between ranks
Interval: additive; no natural zero (zero is a meaningful value)
Ratio: multiplicative; natural zero (zero means an absence of a value)

Make sure you know your column types (dtypes) and levels of measurement before doing analysis!

Measures of central tendency

Mode: the most typical value in a distribution
Median: the “balancing point” in a distribution (50 percent of observations above and below)
Mean: the arithmetic average of a distribution

The mean of a sample (\(\overline{x}\)) is calculated as follows:

\[\overline{x} = \dfrac{x_1 + x_2 + ... + x_n}{n}\]

where \(n\) is the number of elements in the sample.

Measures of dispersion

Range: difference between maximum and minimum values in a distribution
Interquartile range: difference between the values at the 25 percent and 75 percent points in a distribution
Variance and standard deviation

Standard deviation

Computed as the square root of the variance; denoted by \(\sigma\).
Offers a standardized way to discuss the spread of a distribution. For example, in a normally distributed sample:
- About 67 percent of the values will be within one standard deviation of the mean
- About 95 percent of the values will be within two standard deviations of the mean
- About 99 percent of the values will be within three standard deviations of the mean

Descriptive statistics in pandas

Descriptive stats are available in pandas as data frame methods, e.g. grad.mean(), grad.std()
Calling .describe() will give you back a number of important descriptive stats at once

grad.describe()

Exploratory visualization

Often, when exploring a dataset, you’ll want to use graphical representations of your data to help reveal insights/trends
Visualization: Graphical representation of data

Visualization in Python

Core visualization package in Python: matplotlib
seaborn: extension to matplotlib to make your graphics look nicer! Standard import: import seaborn as sns.

Histograms

Histogram: graphical representation of a frequency distribution
Observations are organized into bins, and plotted with values along the x-axis and the number of observations in each bin along the y-axis
Normal distribution: histogram is approximately symmetrical (a “bell curve”)
Histograms are built into pandas

Example histogram (pandas method)

import seaborn as sns
sns.set_style("darkgrid")

grad.rate.hist()

Example histogram (seaborn function)

# Try the `bins` argument to modify the plot appearance
sns.histplot(data = grad, x = "rate")

Density plots

Smooth representations of your data can be produced with kernel density plots
Accessible from both pandas and seaborn

sns.kdeplot(data = grad, x = "rate", shade = True)

Box plots

Also termed “box and whisker plots” - alternative way to show distribution of values graphically

sns.boxplot(data = grad, y = "rate", color = "green")

Anatomy of a box plot

Dots beyond the whiskers: outliers

Violin plots

Combinations of box plots and kernel density plots

sns.violinplot(data = grad, x = "rate", color = "cyan")