Introduction to machine learning

GEOG 30323

April 23, 2024

Data Science

Data science: new(ish) field that has emerged to address the challenges of working with modern data
Fuses statistics, computer science, visualization, graphic design, and the humanities/social sciences/natural sciences…

The data analysis process

Visualization vs. modeling

Hadley Wickham (paraphrased):

Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it doesn’t (fundamentally) surprise.

Statistical modeling

What is the mathematical relationship between an outcome variable \(Y\) and one or more other “predictor” variables \(X_{1}...X_{n}\)?
Recall our use of lmplot in seaborn - lm stands for linear model

Statistical modeling

The linear model:

\[ Y = Xb + e \]

where \(Y\) represents the outcome variable, \(X\) is a matrix of predictors, \(b\) represents the “parameters”, and \(e\) represents the errors, or “residuals”

Linear models will not always be appropriate for modeling relationships between variables!

Statistics in Python

Substantial statistical functionality available in the statsmodels package, available in CoCalc
Example: statistical modeling for inference

Inferential statistics in Python

Let’s get an example ready:

import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf

colleges = pd.read_csv("http://personal.tcu.edu/kylewalker/data/us_colleges_2019.csv")

Linear regression

f1 = smf.ols(formula = 'median_earn ~ grad_rate', data = colleges).fit()

f1.summary()

Multiple regression

formula = 'median_earn ~ grad_rate + sat_avg + adm_rate + family_income'
f2 = smf.ols(formula = formula, data = colleges).fit()
f2.summary()

Machine learning

“The science of getting computers to act without being explicitly programmed”
Types of machine learning algorithms: supervised and unsupervised
Topics in machine learning: classification, clustering, regression

Visual introduction to machine learning: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

Supervised learning

Supervised machine learning: model “trained” and optimized for predictive power
Regression problem: predicting a numeric outcome
Classification problem: predicting a categorical outcome

Regression for prediction

Task: predict the median earnings of graduates 10 years after graduation based on a series of college characteristics
Method: train a model on a subset of the data, then test the model on the remaining subset
Example method: random forest regression

Preparing the model

The train_test_split() function splits your data randomly into training (75 percent, by default) and test datasets

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
np.random.seed(1983)

colleges.dropna(inplace = True)
colleges.reset_index(inplace = True, drop = True)

split = train_test_split(colleges)

train = split[0]
test = split[1]

Preparing the model

features = colleges.columns[2:6].tolist() + \
  colleges.columns[7:].tolist()
  
rf1 = RandomForestRegressor(oob_score = True)
    
rf1.fit(X = train[features], y = train["median_earn"])

Model diagnostics

# Out-of-bag score: how model performs on out-of-bag estimate
print(rf1.oob_score_)

# Feature importance plot
fip = pd.DataFrame(data = {'importance': rf1.feature_importances_,
                           'feature': features})

fip.sort_values('importance', ascending = False, inplace = True)

sns.barplot(x = 'importance', y = 'feature', data = fip)

Model diagnostics

# How does our model perform on our test dataset?
test['predictions'] = rf1.predict(test[features])

print(test['median_earn'].corr(test['predictions']))

sns.lmplot(data = test, x = 'predictions', y = 'median_earn')

Random Forest Classifiers

Task: predict whether a non-profit college is public or private

from sklearn.ensemble import RandomForestClassifier

features2 = colleges.columns[2:15]

rf2 = RandomForestClassifier(oob_score = True)
    
rf2.fit(X = train[features2], y = train["is_private"])

Model diagnostics

from sklearn.metrics import confusion_matrix
predicted_class = rf2.predict(test[features2])

# Prediction accuracy on test set
rf2.score(test[features2], test['is_private'])

# "Confusion" matrix
confusion_matrix(predicted_class, test['is_private'])

# What did we get wrong?
nomatch = test[test['is_private'] != predicted_class]

Feature importance

fip2 = pd.DataFrame(data = {'importance': rf2.feature_importances_,
                           'feature': features2})

fip2.sort_values('importance', ascending = False, inplace = True)

sns.barplot(x = 'importance', y = 'feature', data = fip2)

Making predictions

Unsupervised learning in Python

Imports and setup:

from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import scale

# Convert each column to z-scores 
# (mean of 0, units in standard deviations from the mean)
features_scaled = scale(colleges[features2])

Example: K-means clustering

np.random.seed(1983)

km = KMeans(n_clusters = 7).fit(features_scaled)

colleges['clusters'] = km.labels_

# Check TCU's cluster
colleges[colleges.name == 'Texas Christian University']

Example: K-means clustering

def glimpse_clusters(cluster_id):
    sub = colleges[colleges.clusters == cluster_id]
    print(sub.head(20))
    
glimpse_clusters(2)

Example: nearest-neighbor search

neigh = NearestNeighbors(n_neighbors = 6)

# "Training" the model
neigh.fit(features_scaled) 

# Searching for neighbors
model = neigh.kneighbors(features_scaled, 
                         return_distance = False)

Example: nearest-neighbor search

def find_neighbors(university): 
    # Get the index of the university
    uni_index = colleges[colleges.name == university].index[0]
    # Get the indices of the neighboring universities
    neighbors = list(model[uni_index])[1:]
    # Identify the names of the neighboring universities
    for idx in neighbors:
        nname = colleges.iloc[idx]['name']
        print(nname)

find_neighbors("Texas Christian University")

How to learn more

Take statistics and machine learning courses here at TCU!
Check out DataCamp for hundreds of courses on data science in Python and R