Introduction to machine learning

GEOG 30323

April 23, 2024

Data Science

  • Data science: new(ish) field that has emerged to address the challenges of working with modern data
  • Fuses statistics, computer science, visualization, graphic design, and the humanities/social sciences/natural sciences…

The data analysis process

Visualization vs. modeling

Hadley Wickham (paraphrased):

Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it doesn’t (fundamentally) surprise.

Statistical modeling

  • What is the mathematical relationship between an outcome variable \(Y\) and one or more other “predictor” variables \(X_{1}...X_{n}\)?
  • Recall our use of lmplot in seaborn - lm stands for linear model

Statistical modeling

The linear model:

\[ Y = Xb + e \]

where \(Y\) represents the outcome variable, \(X\) is a matrix of predictors, \(b\) represents the “parameters”, and \(e\) represents the errors, or “residuals”

  • Linear models will not always be appropriate for modeling relationships between variables!

Statistics in Python

  • Substantial statistical functionality available in the statsmodels package, available in CoCalc

  • Example: statistical modeling for inference

Inferential statistics in Python

Let’s get an example ready:

import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf

colleges = pd.read_csv("http://personal.tcu.edu/kylewalker/data/us_colleges_2019.csv")

Linear regression

f1 = smf.ols(formula = 'median_earn ~ grad_rate', data = colleges).fit()

f1.summary()

Multiple regression

formula = 'median_earn ~ grad_rate + sat_avg + adm_rate + family_income'
f2 = smf.ols(formula = formula, data = colleges).fit()
f2.summary()

Machine learning

  • “The science of getting computers to act without being explicitly programmed”
  • Types of machine learning algorithms: supervised and unsupervised
  • Topics in machine learning: classification, clustering, regression

Visual introduction to machine learning: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

Supervised learning

  • Supervised machine learning: model “trained” and optimized for predictive power

  • Regression problem: predicting a numeric outcome

  • Classification problem: predicting a categorical outcome

Regression for prediction

  • Task: predict the median earnings of graduates 10 years after graduation based on a series of college characteristics

  • Method: train a model on a subset of the data, then test the model on the remaining subset

  • Example method: random forest regression

Preparing the model

  • The train_test_split() function splits your data randomly into training (75 percent, by default) and test datasets
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
np.random.seed(1983)

colleges.dropna(inplace = True)
colleges.reset_index(inplace = True, drop = True)

split = train_test_split(colleges)

train = split[0]
test = split[1]

Preparing the model

features = colleges.columns[2:6].tolist() + \
  colleges.columns[7:].tolist()
  
rf1 = RandomForestRegressor(oob_score = True)
    
rf1.fit(X = train[features], y = train["median_earn"]) 

Model diagnostics

# Out-of-bag score: how model performs on out-of-bag estimate
print(rf1.oob_score_)

# Feature importance plot
fip = pd.DataFrame(data = {'importance': rf1.feature_importances_,
                           'feature': features})

fip.sort_values('importance', ascending = False, inplace = True)

sns.barplot(x = 'importance', y = 'feature', data = fip)

Model diagnostics

# How does our model perform on our test dataset?
test['predictions'] = rf1.predict(test[features])

print(test['median_earn'].corr(test['predictions']))

sns.lmplot(data = test, x = 'predictions', y = 'median_earn')

Random Forest Classifiers

  • Task: predict whether a non-profit college is public or private
from sklearn.ensemble import RandomForestClassifier

features2 = colleges.columns[2:15]

rf2 = RandomForestClassifier(oob_score = True)
    
rf2.fit(X = train[features2], y = train["is_private"]) 

Model diagnostics

from sklearn.metrics import confusion_matrix
predicted_class = rf2.predict(test[features2])

# Prediction accuracy on test set
rf2.score(test[features2], test['is_private'])

# "Confusion" matrix
confusion_matrix(predicted_class, test['is_private'])

# What did we get wrong?
nomatch = test[test['is_private'] != predicted_class]

Feature importance

fip2 = pd.DataFrame(data = {'importance': rf2.feature_importances_,
                           'feature': features2})

fip2.sort_values('importance', ascending = False, inplace = True)

sns.barplot(x = 'importance', y = 'feature', data = fip2)

Making predictions


Unsupervised learning in Python

  • Imports and setup:
from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import scale

# Convert each column to z-scores 
# (mean of 0, units in standard deviations from the mean)
features_scaled = scale(colleges[features2])

Example: K-means clustering

np.random.seed(1983)

km = KMeans(n_clusters = 7).fit(features_scaled)

colleges['clusters'] = km.labels_

# Check TCU's cluster
colleges[colleges.name == 'Texas Christian University'] 

Example: K-means clustering

def glimpse_clusters(cluster_id):
    sub = colleges[colleges.clusters == cluster_id]
    print(sub.head(20))
    
glimpse_clusters(2)
    
neigh = NearestNeighbors(n_neighbors = 6)

# "Training" the model
neigh.fit(features_scaled) 

# Searching for neighbors
model = neigh.kneighbors(features_scaled, 
                         return_distance = False)

Example: nearest-neighbor search

def find_neighbors(university): 
    # Get the index of the university
    uni_index = colleges[colleges.name == university].index[0]
    # Get the indices of the neighboring universities
    neighbors = list(model[uni_index])[1:]
    # Identify the names of the neighboring universities
    for idx in neighbors:
        nname = colleges.iloc[idx]['name']
        print(nname)

find_neighbors("Texas Christian University")

How to learn more

  • Take statistics and machine learning courses here at TCU!
  • Check out DataCamp for hundreds of courses on data science in Python and R