import numpy as np
det = np.linalg.det(np.eye(200) * 0.1)
print('determinant '+str(det))
print('log determinant '+str(np.log(det)))
determinant 1.0000000000012853e-200
log determinant -460.51701859880785
Hyperparameter inference
Much of the content today will be based on Chapter 5 in RW. Please read sections 5.1 to 5.4 (inclusive).
Almost all covariance functions have unknown hyperparameters. In many cases, the selection and subsequent optimal set of values of hyperparameters can provide useful insight about the function.
Consider the squared exponential kernel k \left( \mathbf{x}, \mathbf{x}' \right) = \sigma_{f}^2 exp \left( -\frac{1}{2l^2} \left\Vert \mathbf{x} - \mathbf{x}' \right\Vert _{2}^{2} \right) that has two hyperparameters, \sigma_{f} and l.
From a Bayesian perspective, we would like to assign priors to each hyperparameter, and then compute their posteriors.
Recall from before, the marginal likelihood for a Gaussian process: p \left( \mathbf{t} | \sigma^2 \right) = \int p \left( \mathbf{t} | \mathbf{f}, \sigma^2 \right) p \left( \mathbf{f} \right) d \mathbf{f} = \mathcal{N} \left(\mathbf{0}, \mathbf{C} + \sigma^2 \mathbf{I}_{N} \right)
Now, let \boldsymbol{\theta} = \left[ \sigma_f, l, \ldots \right]. Then, we want to work out p \left( \boldsymbol{\theta} | \mathbf{t} \right) = \frac{p \left( \mathbf{t} | \boldsymbol{\theta} \right) p \left( \boldsymbol{\theta} \right) }{p \left( \mathbf{t} \right)}
However, in this case the marginal likelihood is usually intractable! p \left( \mathbf{t} \right) = \int p \left( \mathbf{t} | \boldsymbol{\theta} \right) p \left( \boldsymbol{\theta} \right) d \boldsymbol{\theta}
The workaround is to introduce an approximation. We will use the MAP (maximum a posteriori) estimate. Note that p\left( \mathbf{t} \right) is constant with respect to \boldsymbol{\theta}. p \left( \boldsymbol{\theta} | \mathbf{t} \right) = \frac{p \left( \mathbf{t} | \boldsymbol{\theta} \right) p \left( \boldsymbol{\theta} \right) }{p \left( \mathbf{t} \right)} \propto p \left( \mathbf{t} | \boldsymbol{\theta} \right) p \left( \boldsymbol{\theta} \right)
The MAP estimate is given by \hat{\boldsymbol{\theta}}_{MAP} = \underset{\boldsymbol{\theta}}{argmax} \; log \; p \left( \boldsymbol{\theta} | \mathbf{t} \right) = \underset{\boldsymbol{\theta}}{argmax} \; log \; p \left( \mathbf{t} | \boldsymbol{\theta} \right) + log \; p \left( \boldsymbol{\theta} \right)
Once again, we can write the marginal likelihood as p \left( \mathbf{t} | \sigma^2, \boldsymbol{\theta} \right) = \int p \left( \mathbf{t} | \mathbf{f}, \sigma^2 \right) p \left( \mathbf{f} \right) d \mathbf{f} = \mathcal{N} \left(\mathbf{t} | \mathbf{0}, \mathbf{C}\left( \boldsymbol{\theta} \right) + \sigma^2 \mathbf{I}_{N} \right).
From which, we can write \begin{aligned} log \left(p \left( \mathbf{t} | \sigma^2, \boldsymbol{\theta} \right) \right) & = log \; \mathcal{N} \left(\mathbf{0}, \mathbf{C}\left( \boldsymbol{\theta} \right) + \sigma^2 \mathbf{I}_{N} \right) \\ & = log \left[ \left( 2 \pi \right)^{-\frac{N}{2}} \left| \sigma^2 \mathbf{I} + \mathbf{C} \right|^{-\frac{1}{2}} exp \left( - \frac{1}{2} \left(\sigma^2 \mathbf{I} + \mathbf{C} \right)^{-1} \mathbf{t} \right) \right] \\ & = - \frac{N}{2} log \left( 2 \pi \right) - \frac{1}{2} log \left| \sigma^2 \mathbf{I} + \mathbf{C} \right| - \frac{1}{2} \mathbf{t}^{T} \left(\sigma^2 \mathbf{I} + \mathbf{C} \right)^{-1} \mathbf{t} \end{aligned}
For all intents and purposes, avoid directly computing inverses and determinants!
For instance:
We can break up the calculation into a few steps:
Compute the Cholesky factorization, \mathbf{S} = \mathbf{C} + \sigma^2 \mathbf{I}, to yield \mathbf{S} = \mathbf{LL}^{T}.
Compute the log determinant via log \left| \mathbf{S} \right| = log \left| \mathbf{L}\mathbf{L}^{T} \right| = 2 log \left| \mathbf{L} \right| = 2 \sum_{i=1}^{N} log \; \mathbf{L}_{ii}
Compute the quadratic term via \mathbf{t}^{T} \mathbf{S}^{-1} \mathbf{t} = \mathbf{t}^{T} \mathbf{LL}^{-T} \mathbf{t} = \left( \mathbf{L}^{-1} \mathbf{t} \right)^{T} \left( \mathbf{L}^{-1} \mathbf{t}\right) = \mathbf{a}^{T} \mathbf{a}
Note: never compute the determinant or the inverse of \mathbf{S} directly!
L12 | Hyperparameter inference