Modules, packages, and EDA in Python

GEOG 30323

February 6, 2024

Time for data!

Source: bigdatapix.tumblr.com

The data analysis process

Source: Wickham and Grolemund, R for Data Science

Exploratory data analysis

  • “Detective work” to summarize and explore datasets

Includes:

  • Data acquisition and input
  • Data cleaning and wrangling (“tidying”)
  • Data transformation and summarization
  • Data visualization

Your core Python tools for EDA: NumPy, pandas, and seaborn/matplotlib

Modules and packages

  • Module: file containing variables, functions, etc. that can be imported into a Python session with the import statement
  • Package: directory of modules that perform similar tasks (e.g. data visualization, statistics, etc.)
  • Thousands upon thousands of Python packages available - that do just about anything!

Built-in packages

  • Many packages are included in stdlib, the standard library that ships with Python
  • Popular modules: re for regular expressions; os for operating system functions; random for random-number generation; and many more. Full list: https://docs.python.org/3/library/

The PyData ecosystem

Source: Jake VanderPlas, SciPy 2015 Keynote

NumPy

  • Extension to Python; the core Python package for numerical computing
  • Standard import: import numpy as np
  • Data structure: the NumPy array. Sort of like a list - but with more methods, and can be multidimensional
import numpy as np

y = np.array([[2, 4, 6, 8, 10, 12], 
             [1, 3, 5, 7, 9, 11], 
             [10, 12, 14, 18, 22, 14], 
             [9, 3, 3, 3, 3, 1]])
            

Pandas

  • Built on top of NumPy; adds support for table-like data structures in Python
  • Standard import: import pandas as pd
  • Sequences of data are stored as Series objects, which collectively form DataFrames
import pandas as pd

df = pd.DataFrame(y, columns = ['x' + str(num) for num in range(1, 7)])

The pandas DataFrame

# To read in CSV files, we use the pd.read_csv function 
grad = pd.read_csv("grad_rates.csv")

Tutorial: working with external data in Colab

The pandas DataFrame

  • Each observation forms a row, defined by an index; attributes of those observations are found in the columns of the DataFrame

  • Columns are accessible as indices, e.g. grad['rate'], or as attributes of the data frame, e.g. grad.rate

The Python namespace

  • When you declare variables, define functions, import modules, etc., you are adding objects to the Python namespace
  • To remove objects from the Python namespace, use the del statement

Imports and the namespace

  • Imported modules can be referenced in multiple ways:
# All of these are equivalent

import pandas
pandas.read_csv("grad_rates.csv")

import pandas as pd
pd.read_csv("grad_rates.csv")

from pandas import *
read_csv("grad_rates.csv")

Levels of measurement

  • Nominal: qualitative, descriptive, categories
  • Ordinal: ordering or ranking; however, no information about distance between ranks
  • Interval: additive; no natural zero (zero is a meaningful value)
  • Ratio: multiplicative; natural zero (zero means an absence of a value)

Make sure you know your column types (dtypes) and levels of measurement before doing analysis!

Measures of central tendency

  • Mode: the most typical value in a distribution
  • Median: the “balancing point” in a distribution (50 percent of observations above and below)
  • Mean: the arithmetic average of a distribution

The mean of a sample (\(\overline{x}\)) is calculated as follows:

\[\overline{x} = \dfrac{x_1 + x_2 + ... + x_n}{n}\]

where \(n\) is the number of elements in the sample.

Measures of dispersion

  • Range: difference between maximum and minimum values in a distribution
  • Interquartile range: difference between the values at the 25 percent and 75 percent points in a distribution
  • Variance and standard deviation

Standard deviation

  • Computed as the square root of the variance; denoted by \(\sigma\).
  • Offers a standardized way to discuss the spread of a distribution. For example, in a normally distributed sample:
    • About 67 percent of the values will be within one standard deviation of the mean
    • About 95 percent of the values will be within two standard deviations of the mean
    • About 99 percent of the values will be within three standard deviations of the mean

Descriptive statistics in pandas

  • Descriptive stats are available in pandas as data frame methods, e.g. grad.mean(), grad.std()
  • Calling .describe() will give you back a number of important descriptive stats at once
grad.describe()

Exploratory visualization

  • Often, when exploring a dataset, you’ll want to use graphical representations of your data to help reveal insights/trends
  • Visualization: Graphical representation of data

Visualization in Python

  • Core visualization package in Python: matplotlib
  • seaborn: extension to matplotlib to make your graphics look nicer! Standard import: import seaborn as sns.

Histograms

  • Histogram: graphical representation of a frequency distribution
  • Observations are organized into bins, and plotted with values along the x-axis and the number of observations in each bin along the y-axis
  • Normal distribution: histogram is approximately symmetrical (a “bell curve”)
  • Histograms are built into pandas

Example histogram (pandas method)

import seaborn as sns
sns.set_style("darkgrid")

grad.rate.hist()

Example histogram (seaborn function)

# Try the `bins` argument to modify the plot appearance
sns.histplot(data = grad, x = "rate")

Density plots

  • Smooth representations of your data can be produced with kernel density plots
  • Accessible from both pandas and seaborn
sns.kdeplot(data = grad, x = "rate", shade = True)

Box plots

  • Also termed “box and whisker plots” - alternative way to show distribution of values graphically
sns.boxplot(data = grad, y = "rate", color = "green")

Anatomy of a box plot

  • Dots beyond the whiskers: outliers

Violin plots

  • Combinations of box plots and kernel density plots
sns.violinplot(data = grad, x = "rate", color = "cyan")