This attempts to describe kernels. The hope is after going through this, the reader appreciates just how powerful kernels are, and the role they play in Gaussian process models.
Code
### Data import plotly.graph_objects as goimport plotly.figure_factory as ffimport plotly.express as pximport numpy as npimport pandas as pdimport plotly.io as pioimport numpy as nppio.renderers.default ='iframe'from IPython.display import display, HTML
Minimum norm solutions
One way to motivate the study of kernels, is to consider a linear regression problem where one has more unknowns than observational data. Let \(\mathbf{X} = \left[\mathbf{x}_{1}^{T}, \mathbf{x}_{2}^{T}, \ldots, \mathbf{x}_{N}^{T}\right]\) be the \(N \times d\) data corresponding to \(N\) observations of \(d\)-dimensional data. These input observations are accompanied by an output observational vector, \(\mathbf{y} \in \mathbb{R}^{N}\). Let \(\boldsymbol{\Phi} \left( \mathbf{X} \right) \in \mathbb{R}^{N \times M}\) be a parameterized matrix comprising of \(M\) basis functions, i.e.,
If we are interested in approximating \(f \left( \mathbf{X} \right) \approx \hat{f} \left( \mathbf{X} \right) = y = \mathbf{\Phi} \left( \mathbf{X} \right) \boldsymbol{\alpha}\), we can determine the unknown coefficients via least squares. This leads to the solution via the normal equations
Now, strictly speaking, one cannot use the normal equations to solve a problem where there are more unknowns than observations because \(\left( \mathbf{\Phi}^{T} \mathbf{\Phi}\right)\) is not full rank. Recognizing that in such a situation, there may likely be numerous solutions to \(\boldsymbol{\Phi}\left( \mathbf{X} \right) \boldsymbol{\alpha} = \mathbf{y}\), we want the solution with the lowest \(L_2\) norm. This can be more conveniently formulated as as minimum norm problem, written as
This leads to \(\boldsymbol{\alpha} = - \boldsymbol{\Phi}^{T} \lambda / 2\). Substituting this into the second expression above yields \(\lambda = -2 \left(\boldsymbol{\Phi} \boldsymbol{\Phi}^{T} \right)^{-1} \mathbf{y}\). This leads to the minimum norm solution
Note that unlike \(\left( \boldsymbol{\Phi}^{T} \boldsymbol{\Phi} \right)\), \(\left( \boldsymbol{\Phi} \boldsymbol{\Phi}^{T} \right)\) does have full rank. The latter is an inner product between feature vectors. To see this, define the two-point kernel function
The kernel trick (feature maps \(\rightarrow\) kernels)
From the coefficients \(\boldsymbol{\alpha}\) computed via the minimum norm solution, it should be clear that approximate values of the true function at new locations \(\mathbf{X}_{\ast}\) can be given via
The form of the expression above is exactly that of the posterior predictive mean of a noise-free Gaussian processes model.
One need not compute the full \(N \times M\) feature matrix \(\boldsymbol{\Phi} \left( \mathbf{X} \right)\) explictly to work out the \(N \times N\) matrix \(\mathbf{K}\left( \mathbf{X}, \mathbf{X} \right)\).
This latter point is why this is called the kernel trick, i.e., for a very large number of features \(M >> N\) (possibly infinite), it is more computationally efficient to work out \(\mathbf{K}\).
Another way to interpret the kernel trick is to consider the example that was discussed in lecture with regards to the data in the plot below.
A lifting analogy
Consider a quadratic kernel in \(\mathbb{R}^{2}\), where \(\mathbf{x} = \left(x_1, x_2 \right)^{T}\) and \(\mathbf{v} = \left(v_1, v_2 \right)^{T}\). We can express this kernel as \[
\begin{aligned}
k \left( \mathbf{x}, \mathbf{v} \right) = \left( \mathbf{x}^{T} \mathbf{v} \right)^2 & = \left( \left[\begin{array}{cc}
x_{1} & x_{2}\end{array}\right]\left[\begin{array}{c}
v_{1}\\
v_{2}
\end{array}\right] \right)^2 \\
& = \left( x_1^2 v_1^2 + 2 x_1 x_2 v_1 v_2 + x_2^2 v_2^2\right) \\
& = \left[\begin{array}{ccc}
x^2_{1} & \sqrt{2} x_1 x_2 & x_2^2 \end{array}\right]\left[\begin{array}{c}
v_{1}^2\\
\sqrt{2}v_1 v_2 \\
v_{2}^2
\end{array}\right] \\
& = \phi \left( \mathbf{x} \right)^{T} \phi \left( \mathbf{v} \right).
\end{aligned}
\] where \(\phi \left( \cdot \right) \in \mathbb{R}^{3}\).
Now lets tabulate the number of operations required depending on which route one takes. Computing \(\left( \mathbf{x}^{T} \mathbf{v} \right)^2\) requires two multiplications (i.e., \(x_1 \times v_1\) and \(x_2 \times v_2\)), one sum (i.e., \(s = x_1 v_1 + x_2 v_2\)), and one product (i.e., \(s^2\)). This leads to a total of four operations.
Now consider the number of operations required for computing $ ( )^{T} ( )$. Assembling \(\phi \left( \mathbf{x} \right)\) itself requires three products; multiplying by \(\phi \left( \mathbf{v} \right)\) incurs another three products leading to a total of 10 operations (9 multiplications and one sum). Thus, computationally, it is cheaper to use the original form for calculating the product.
There is however another perspective to this. Data that is not linearly separable in \(\mathbb{R}^{2}\) can be lifted up to \(\mathbb{R}^{3}\) where a separation may be more easily inferred. In this particular case, \(\phi \left( \mathbf{x} \right)\) takes the form of a polynomial kernel.
To visualize this consider a red and green set of random points within a circle. Points that have a relatively greater radius are shown in red, whilst points that are closer to the center are captured in green.
Code
t = np.random.rand(40,1)*2* np.pir = np.random.rand(40,1)*0.2+2u = r * np.cos(t)v = r * np.sin(t)xy = np.vstack([np.random.rand(20,2)*2-1])xy2 = np.hstack([u, v])
It is not possible to separate these two sets using a line (or more generally a hyperplane). However, when the same data is lifed to \(\mathbb{R}^{3}\), the two sets are linearly separable.
where \(c_j\) represents the center of the bell-shaped basis function; we assume that there are infinitely many centers across the domain of interest and thus there exists infinitely many basis terms. To visualize this, see the code below.
The two-point covariance matrix can be written as the sum of outer products of the feature vectors evaluated at all points \(\mathbf{X}\) (which are individually rank-one matrices):
The last expression is easily recognizable as an RBF kernel with an amplitude of \(\sqrt{\pi}l\) and a slightly amended length scale of \(\sqrt{2}l^2\). It is straightforward to adapt this to multivariate \(\mathbf{x}\).
Note the utility of this representation—we essentially have infinitely many basis terms, but the size of our covariance matrix is driven by the number of data points