Code
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
from scipy.linalg import cholesky, solve_triangular
import seaborn as sns
This notebook provides a quick start guide to building a Gaussian process model using only numpy
, scipy
, pandas
, and matplotlib
.
In this tutorial, we will use the Olympic gold dataset that we have used quite a few times in Lecture. First, we shall use pandas
to retrieve the data.
The three code blocks below define the kernel, a utility function for tiling, and the posterior calculation. To clarify, this is the predictive posterior distribution evaluated at some test locations, \(\mathbf{X}_{\ast}\). The expression for both the predictive posterior mean and covariance are given by:
\[ \begin{aligned} \mathbb{E} \left[ \mathbf{y}_{\ast} | \mathbf{X}_{\ast} \right] & = \mathbf{K}\left( \mathbf{X}_{\ast}, \mathbf{X} \right) \left[\mathbf{K}\left( \mathbf{X}, \mathbf{X} \right) + \sigma_{n}^2 \mathbf{I} \right]^{-1} \mathbf{y} \\ Covar\left[ \mathbf{y}_{\ast} | \mathbf{X}_{\ast} \right] & = \mathbf{K}\left( \mathbf{X}_{\ast}, \mathbf{X}_{\ast} \right) - \mathbf{K}\left( \mathbf{X}_{\ast}, \mathbf{X} \right) \left[\mathbf{K}\left( \mathbf{X}, \mathbf{X} \right) + \sigma_{n}^2 \mathbf{I} \right]^{-1} \mathbf{K}\left( \mathbf{X}, \mathbf{X}{\ast} \right) \end{aligned} \]
def kernel(xa, xb, amp, ll):
Xa, Xb = get_tiled(xa, xb)
return amp**2 * np.exp(-0.5 * 1./ll**2 * (Xa - Xb)**2 )
def get_tiled(xa, xb):
m, n = len(xa), len(xb)
xa, xb = xa.reshape(m,1) , xb.reshape(n,1)
Xa = np.tile(xa, (1, n))
Xb = np.tile(xb.T, (m, 1))
return Xa, Xb
def get_posterior(amp, ll, x, x_data, y_data, noise):
u = y_data.shape[0]
mu_y = np.mean(y_data)
y = (y_data - mu_y).reshape(u,1)
Sigma = noise * np.eye(u)
Kxx = kernel(x_data, x_data, amp, ll)
Kxpx = kernel(x, x_data, amp, ll)
Kxpxp = kernel(x, x, amp, ll)
# Inverse
jitter = np.eye(u) * 1e-12
L = cholesky(Kxx + Sigma + jitter)
S1 = solve_triangular(L.T, y, lower=True)
S2 = solve_triangular(L.T, Kxpx.T, lower=True).T
mu = S2 @ S1 + mu_y
cov = Kxpxp - S2 @ S2.T
return mu, cov
Xt = Xt.flatten()
mu = mu.flatten()
std = np.sqrt(np.diag(cov)).flatten()
fig = plt.figure(figsize=(8, 5))
plt.plot(Xt, mu, '-', label=r'$\mu$', color='navy')
plt.fill_between(Xt, mu+std, mu-std, color='blue', alpha=0.2, label=r'$\sigma$')
plt.plot(df['Year'].values, df['Time'].values, 'go', label='Data', ms=8)
plt.xlabel('Years')
plt.ylabel('Winning times')
plt.legend()
plt.show()