Lecture 12

Hyperparameter inference

Hyperparameters

Much of the content today will be based on Chapter 5 in RW. Please read sections 5.1 to 5.4 (inclusive).

Hyperparameters

  • Almost all covariance functions have unknown hyperparameters. In many cases, the selection and subsequent optimal set of values of hyperparameters can provide useful insight about the function.

  • Consider the squared exponential kernel k \left( \mathbf{x}, \mathbf{x}' \right) = \sigma_{f}^2 exp \left( -\frac{1}{2l^2} \left\Vert \mathbf{x} - \mathbf{x}' \right\Vert _{2}^{2} \right) that has two hyperparameters, \sigma_{f} and l.

  • From a Bayesian perspective, we would like to assign priors to each hyperparameter, and then compute their posteriors.

Hyperparameters

  • Recall from before, the marginal likelihood for a Gaussian process: p \left( \mathbf{t} | \sigma^2 \right) = \int p \left( \mathbf{t} | \mathbf{f}, \sigma^2 \right) p \left( \mathbf{f} \right) d \mathbf{f} = \mathcal{N} \left(\mathbf{0}, \mathbf{C} + \sigma^2 \mathbf{I}_{N} \right)

  • Now, let \boldsymbol{\theta} = \left[ \sigma_f, l, \ldots \right]. Then, we want to work out p \left( \boldsymbol{\theta} | \mathbf{t} \right) = \frac{p \left( \mathbf{t} | \boldsymbol{\theta} \right) p \left( \boldsymbol{\theta} \right) }{p \left( \mathbf{t} \right)}

  • However, in this case the marginal likelihood is usually intractable! p \left( \mathbf{t} \right) = \int p \left( \mathbf{t} | \boldsymbol{\theta} \right) p \left( \boldsymbol{\theta} \right) d \boldsymbol{\theta}

MAP

  • The workaround is to introduce an approximation. We will use the MAP (maximum a posteriori) estimate. Note that p\left( \mathbf{t} \right) is constant with respect to \boldsymbol{\theta}. p \left( \boldsymbol{\theta} | \mathbf{t} \right) = \frac{p \left( \mathbf{t} | \boldsymbol{\theta} \right) p \left( \boldsymbol{\theta} \right) }{p \left( \mathbf{t} \right)} \propto p \left( \mathbf{t} | \boldsymbol{\theta} \right) p \left( \boldsymbol{\theta} \right)

  • The MAP estimate is given by \hat{\boldsymbol{\theta}}_{MAP} = \underset{\boldsymbol{\theta}}{argmax} \; log \; p \left( \boldsymbol{\theta} | \mathbf{t} \right) = \underset{\boldsymbol{\theta}}{argmax} \; log \; p \left( \mathbf{t} | \boldsymbol{\theta} \right) + log \; p \left( \boldsymbol{\theta} \right)

MAP vs Maximum likelihood

  • Note, if the priors p \left( \boldsymbol{\theta} \right) are uniform, then \begin{aligned} \hat{\boldsymbol{\theta}}_{MAP} & = \underset{\boldsymbol{\theta}}{argmax} \; log \; p \left( \mathbf{t} | \boldsymbol{\theta} \right) + log \; p \left( \boldsymbol{\theta} \right) \\ & = \underset{\boldsymbol{\theta}}{argmax} \; log \; p \left( \mathbf{t} | \boldsymbol{\theta} \right) + log \; \textrm{constant} \\ & \propto \underset{\boldsymbol{\theta}}{argmax} \; log \; p \left( \mathbf{t} | \boldsymbol{\theta} \right) \\ & = \hat{\boldsymbol{\theta}}_{ML} \end{aligned} in which case we have the marginal likelihood estimate.

Marginal likelihood

  • Once again, we can write the marginal likelihood as p \left( \mathbf{t} | \sigma^2, \boldsymbol{\theta} \right) = \int p \left( \mathbf{t} | \mathbf{f}, \sigma^2 \right) p \left( \mathbf{f} \right) d \mathbf{f} = \mathcal{N} \left(\mathbf{t} | \mathbf{0}, \mathbf{C}\left( \boldsymbol{\theta} \right) + \sigma^2 \mathbf{I}_{N} \right).

  • From which, we can write \begin{aligned} log \left(p \left( \mathbf{t} | \sigma^2, \boldsymbol{\theta} \right) \right) & = log \; \mathcal{N} \left(\mathbf{0}, \mathbf{C}\left( \boldsymbol{\theta} \right) + \sigma^2 \mathbf{I}_{N} \right) \\ & = log \left[ \left( 2 \pi \right)^{-\frac{N}{2}} \left| \sigma^2 \mathbf{I} + \mathbf{C} \right|^{-\frac{1}{2}} exp \left( - \frac{1}{2} \left(\sigma^2 \mathbf{I} + \mathbf{C} \right)^{-1} \mathbf{t} \right) \right] \\ & = - \frac{N}{2} log \left( 2 \pi \right) - \frac{1}{2} log \left| \sigma^2 \mathbf{I} + \mathbf{C} \right| - \frac{1}{2} \mathbf{t}^{T} \left(\sigma^2 \mathbf{I} + \mathbf{C} \right)^{-1} \mathbf{t} \end{aligned}

Marginal likelihood

  • From the prior slide log \left(p \left( \mathbf{t} | \sigma^2, \boldsymbol{\theta} \right) \right) = \underbrace{-\frac{N}{2} log \left( 2 \pi \right)}_{\textsf{constant}} - \underbrace{\frac{1}{2} log \left| \sigma^2 \mathbf{I} + \mathbf{C} \right|}_{\textsf{complexity}} - \underbrace{\frac{1}{2} \mathbf{t}^{T} \left(\sigma^2 \mathbf{I} + \mathbf{C} \right)^{-1} \mathbf{t}}_{\textsf{data fit}}
  • Standard approach is to optimize this using gradient-based methods by computing \nabla_{\boldsymbol{\theta}} log \left(p \left( \mathbf{t} | \sigma^2, \boldsymbol{\theta} \right) \right)

Practical computation

  • For all intents and purposes, avoid directly computing inverses and determinants!

  • For instance:

import numpy as np

det = np.linalg.det(np.eye(200) * 0.1)
print('determinant '+str(det)) 
print('log determinant '+str(np.log(det)))
determinant 1.0000000000012853e-200
log determinant -460.51701859880785

Practical computation

We can break up the calculation into a few steps:

  1. Compute the Cholesky factorization, \mathbf{S} = \mathbf{C} + \sigma^2 \mathbf{I}, to yield \mathbf{S} = \mathbf{LL}^{T}.

  2. Compute the log determinant via log \left| \mathbf{S} \right| = log \left| \mathbf{L}\mathbf{L}^{T} \right| = 2 log \left| \mathbf{L} \right| = 2 \sum_{i=1}^{N} log \; \mathbf{L}_{ii}

  3. Compute the quadratic term via \mathbf{t}^{T} \mathbf{S}^{-1} \mathbf{t} = \mathbf{t}^{T} \mathbf{LL}^{-T} \mathbf{t} = \left( \mathbf{L}^{-1} \mathbf{t} \right)^{T} \left( \mathbf{L}^{-1} \mathbf{t}\right) = \mathbf{a}^{T} \mathbf{a}

Practical computation

  1. Summing it all up: log \left(p \left( \mathbf{t} | \sigma^2, \boldsymbol{\theta} \right) \right) = - \frac{N}{2} log \left( 2 \pi \right) - \sum_{i=1}^{N} log \; \mathbf{L}_{ii} - \frac{1}{2} \mathbf{a}^{T} \mathbf{a}

Note: never compute the determinant or the inverse of \mathbf{S} directly!