Recall, from last time, we had demonstrated that the maximum likelihood solution to the Gaussian noise model is equivalent to the least squares solution for the unknown weights, \mathbf{w}.
Recall, that we had defined the likelihood as a function that describes how likely our data is, based on the model parameters.
We shall now study the benefits of explicitly modelling noise:
quantifying uncertainty in the parameters;
expressing uncertainty in predictions.
Gaussian noise model
Consider a scenario where rather than solving
\hat{\mathbf{w}} = \left( \mathbf{X}^{T} \mathbf{X} \right)^{-1} \mathbf{X}^{T} \mathbf{t}
we randomly perturb\mathbf{t} and work out \hat{\mathbf{w}} for each perturbed value, i.e.,
\hat{\mathbf{w}}_{j} = \left( \mathbf{X}^{T} \mathbf{X} \right)^{-1} \mathbf{X}^{T} \left( \mathbf{t} + \boldsymbol{\epsilon}_{j} \right), \; \; \; \; \text{where} \; \; \boldsymbol{\epsilon}_{j} \sim \mathcal{N} \left( \mathbf{0}, \sigma^2 \mathbf{I} \right).
It will be useful to understand how this change to the data—contaminated by noise—affects the model parameters. Recall, last time, we had expressed our model which was a product of univariate Gaussians as a single multivariate Gaussian of the form
p \left( \mathbf{t} | \mathbf{X}, \mathbf{w}, \sigma^2 \right) = \mathcal{N}\left( \mathbf{Xw}, \sigma^2 \mathbf{I} \right)
In what follows, we would like to better describe the scatter we saw in the prior slide, by computing the mean and covariance of \mathbf{w}. Computing the expectation with respect to the generating distribution
The above matrix is important! Infact, you can show that
Cov \left[ \mathbf{\hat{w}} \right] = - \left( \frac{\partial^2 \mathcal{L} }{\partial \mathbf{w} \partial \mathbf{w}^{T}} \right)^{-1}.
This tells us that our confidence in the parameters is linked directly to the second derivative of the log-likelihood function.
Low curvature corresponds to high uncertainty in \mathbf{\hat{w}}.
We have an expression that tells us how much information about the parameter estimates \mathbf{\hat{w}}, our data gives us.
Gaussian noise model
The matrix \sigma^2 \left( \mathbf{X}^{T} \mathbf{X} \right)^{-1} is the inverse of the Fisher Information matrix, \mathcal{I}, i.e.,
\mathcal{I} = \frac{1}{\sigma^2} \mathbf{X}^{T} \mathbf{X}.
The elements of this matrix tell us how much information the data provides: the more negative the information value is, the more information there is.
If the data is very noisy, the information content is lower.
We have quantified the mean and covariance in the model parameters \mathbb{w}, but have not yet determined how they can be used to determine the mean and covariance of predicted times.
To aid this exposition, let us introduce \mathbf{X}_{new} \in \mathbb{R}^{S \times 2} where S is the number of newx locations (in this case time) at which our model needs to be evaluated, i.e.,
\mathbf{X}_{new}=\left[\begin{array}{cc}
1 & x^{new}_{1}\\
1 & x^{new}_{2}\\
\vdots & \vdots\\
1 & x^{new}_{S}
\end{array}\right]
To predict \mathbf{t}_{new}, we simply multiply \mathbf{X}_{new} by the best set of model parameters \mathbf{\hat{w}}, i.e., \mathbf{t}_{new} = \mathbf{X}_{new} \mathbf{\hat{w}}
Using the formulas for the mean and covariance we can write
\begin{aligned}
\mathbb{E}_{p \left( \mathbf{t} | \mathbf{X}, \mathbf{w}, \sigma^2 \right)} \left[ \mathbf{t}_{new} \right] & = \mathbf{X}_{new} \mathbb{E}_{p \left( \mathbf{t} | \mathbf{X}, \mathbf{w}, \sigma^2 \right)} \left[ \mathbf{\hat{w}} \right] = \mathbf{X}_{new} \mathbf{w} \\
Cov_{p \left( \mathbf{t} | \mathbf{X}, \mathbf{w}, \sigma^2 \right)} \left[ \mathbf{t}_{new} \right] & = \sigma^2 \mathbf{X}^{T}_{new} \left( \mathbf{X}^{T} \mathbf{X} \right)^{-1} \mathbf{X}_{new} \\
& = \mathbf{X}_{new} Cov \left[ \mathbf{\hat{w}} \right] \mathbf{X}^{T}_{new}
\end{aligned}
Recall, earlier we had established that uncertainty in the model parameters \mathbf{\hat{w}} can be used to explain uncertainty in the data \mathbf{t}.
We now introduce the notion of a prior to \mathbf{w}. To clarify, when introducing a prior to \mathbf{w}. For no better reason for now, we will assume a Gaussian distribution, i.e.,
p \left( \mathbf{w} | \boldsymbol{\mu}_{0}, \boldsymbol{\Sigma}_{0} \right) = \mathcal{N} \left( \boldsymbol{\mu}_{0}, \boldsymbol{\Sigma}_{0} \right)
where the choice of the mean \boldsymbol{\mu}_{0} and covariance \boldsymbol{\Sigma}_{0} is typically selected by the user / practitioner.
From a notation standpoint, note that we will not explicitly condition on \boldsymbol{\mu}_{0} and \boldsymbol{\Sigma}_{0}, i.e., we will use p \left( \mathbf{w} | \mathbf{t}, \mathbf{X}, \sigma^2 \right) instead of p \left( \mathbf{w} | \mathbf{t}, \mathbf{X}, \sigma^2, \boldsymbol{\mu}_{0}, \boldsymbol{\Sigma}_{0} \right).
Scatter in the model parameters, \mathbf{w}.
Bayesian Model
The likelihood, p \left( \mathbf{t} | \mathbf{w}, \mathbf{X}, \sigma^2 \right) is the quantity that we had maximized before.
Note that our model is of the form \mathbf{t} = \mathbf{Xw} + \boldsymbol{\epsilon}, where \boldsymbol{\epsilon} \sim \mathcal{N}\left(\mathbf{0}, \sigma^2 \mathbf{I} \right).
As we have shown before, a Gaussian random variable plus a constant is equivalent to a Gaussian random variable with a shifted mean, i.e.,
p \left( \mathbf{t} | \mathbf{w}, \mathbf{X}, \sigma^2 \right) = \mathcal{N}\left( \mathbf{Xw}, \sigma^2 \mathbf{I} \right)
The likelihood is sometimes called the noise model.
Bayesian Model
Notionally, we can express the posterior density as
\textsf{posterior} \propto \textsf{likelihood} \times \textsf{prior}
where the symbol \propto hides the proportionality factor given by \int \text{likelihood} \times \text{prior}, where the integral is evaluated over the space of prior.
This proportionality factor is also known as the:
marginal likelihood
evidence
prior predictive
partition function.
The posterior is written using Bayes’ rule via
p \left( \mathbf{w} | \mathbf{t}, \mathbf{X}, \sigma^2 \right) = \frac{p \left( \mathbf{t} | \mathbf{w}, \mathbf{X}, \sigma^2 \right) p \left( \mathbf{w} \right) }{\int p \left( \mathbf{t} | \mathbf{w}, \mathbf{X}, \sigma^2 \right) p \left( \mathbf{w} \right) d\mathbf{w} } = \frac{p \left( \mathbf{t} | \mathbf{w}, \mathbf{X}, \sigma^2 \right) p \left( \mathbf{w} \right) }{p \left( \mathbf{t} | \mathbf{X}, \sigma^2 \right)}
Bayesian Model
Prior
p \left( \mathbf{w} \right)
Likelihood
p \left( \mathbf{t} | \mathbf{w}, \mathbf{X}, \sigma^2 \right)
Posterior
p \left( \mathbf{w} | \mathbf{t}, \mathbf{X}, \sigma^2 \right)
where
\mathbf{X} encodes the model attributes (or inputs or covariates),
\mathbf{t} are the observed values (or outputs), and
\sigma^2 in this case is an assumed noise.
A graphical representation of the model.
Bayesian Model
We shall now solve for this posterior, assuming that both the prior and likelihood are Gaussian.
In this case, we know that the posterior will be Gaussian!