( Skip the basic parts + not important contents )

6. Kernel Methods

Sometimes, training data ( or subset of them ) are kept during the prediction phase!

Memory based methods

  • involves storing the entire training set in order to make future prediction

    ex) Kernel function, KNN

  • generally FAST to train, SLOW to predict test data


Kernel function

  • \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\phi(\mathbf{x})^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right)\).
  • symmetric function : \(k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=k\left(\mathrm{x}^{\prime}, \mathrm{x}\right)\)
  • used in SVM ( Boser et al. ,1992 )


Kernel Trick ( = Kernel Substitution )

  • ex) can be applied to…

    • nonlinear variant of PCA
    • kernel Fisher discriminant


Different types of kernel

  • linear kernel : \(\phi(\mathrm{x})=\mathrm{x}\) \(\rightarrow\) \(k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=\mathrm{x}^{\mathrm{T}} \mathrm{x}^{\prime}\)

  • (property) stationary : \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=k\left(\mathbf{x}-\mathbf{x}^{\prime}\right),\)

    • ex) rbf function( radial basis function, homogeneous kernels )

      \(k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=k\left(\left\|\mathrm{x}-\mathrm{x}^{\prime}\right\|\right)\).

6-1. Dual Representations

minimize SSE function :

  • \(J(\mathbf{w})=\frac{1}{2} \sum_{n=1}^{N}\left\{\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)-t_{n}\right\}^{2}+\frac{\lambda}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}\).


[step 1] take derivative \& set it to zero :

  • \(\mathbf{w}=-\frac{1}{\lambda} \sum_{n=1}^{N}\left\{\mathbf{w}^{\mathrm{T}} \phi\left(\mathbf{x}_{n}\right)-t_{n}\right\} \phi\left(\mathbf{x}_{n}\right)=\sum_{n=1}^{N} a_{n} \phi\left(\mathbf{x}_{n}\right)=\mathbf{\Phi}^{\mathrm{T}} \mathbf{a}\).

    where \(a_{n}=-\frac{1}{\lambda}\left\{\mathbf{w}^{\mathrm{T}} \phi\left(\mathbf{x}_{n}\right)-t_{n}\right\}\)


[step 2] Instead of working with \(\mathbf{w}\), use \(\mathbf{a}\) ( = substitute \(\mathrm{w}=\Phi^{\mathrm{T}}\) in \(J(\mathrm{w})\) )

  • \(J(\mathbf{a})=\frac{1}{2} \mathbf{a}^{\mathrm{T}} \Phi \Phi^{\mathrm{T}} \Phi \Phi^{\mathrm{T}} \mathbf{a}-\mathbf{a}^{\mathrm{T}} \Phi \Phi^{\mathrm{T}} \mathbf{t}+\frac{1}{2} \mathbf{t}^{\mathrm{T}} \mathbf{t}+\frac{\lambda}{2} \mathbf{a}^{\mathrm{T}} \Phi \Phi^{\mathrm{T}} \mathbf{a}\) .


[step 3] Use Gram matrix \(\mathbf{K}=\Phi \Phi^{\mathrm{T}}\)

  • \(J(\mathbf{a})=\frac{1}{2} \mathbf{a}^{\mathrm{T}} \mathbf{K} \mathbf{K} \mathbf{a}-\mathbf{a}^{\mathrm{T}} \mathbf{K} \mathbf{t}+\frac{1}{2} \mathbf{t}^{\mathrm{T}} \mathbf{t}+\frac{\lambda}{2} \mathbf{a}^{\mathrm{T}} \mathbf{K} \mathbf{a}\).


[step 4] Solution ( \(\mathbf{a}\) )

  • \(\mathbf{a}=\left(\mathbf{K}+\lambda \mathbf{I}_{N}\right)^{-1} \mathbf{t}\).


[step 5] Prediction for new input \(\mathbf{x}\)

  • substitute this back into LR model
  • \(y(\mathbf{x})=\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{x})=\mathbf{a}^{\mathrm{T}} \boldsymbol{\Phi} \boldsymbol{\phi}(\mathbf{x})=\mathbf{k}(\mathbf{x})^{\mathrm{T}}\left(\mathbf{K}+\lambda \mathbf{I}_{N}\right)^{-1} \mathbf{t}\).
    • \(\mathrm{k}(\mathrm{x})\) with elements \(k_{n}(\mathrm{x})=k\left(\mathrm{x}_{n}, \mathrm{x}\right)\)


Dual formulation allows us to “express entirely in terms of KERNL FUNCTION \(k(x,x')\)

6-2. Constructing Kernels

\[k\left(x, x^{\prime}\right)=\phi(x)^{\mathrm{T}} \phi\left(x^{\prime}\right)=\sum_{i=1}^{M} \phi_{i}(x) \phi_{i}\left(x^{\prime}\right)\]
  • \(\phi_{i}(x)\) are the basis function

  • example) \(k(\mathbf{x}, \mathbf{z})=\left(\mathbf{x}^{\mathrm{T}} \mathbf{z}\right)^{2}\)

    \(\begin{aligned} k(\mathbf{x}, \mathbf{z}) &=\left(\mathbf{x}^{\mathrm{T}} \mathbf{z}\right)^{2}=\left(x_{1} z_{1}+x_{2} z_{2}\right)^{2} \\ &=x_{1}^{2} z_{1}^{2}+2 x_{1} z_{1} x_{2} z_{2}+x_{2}^{2} z_{2}^{2} \\ &=\left(x_{1}^{2} \sqrt{2} x_{1} x_{2}, x_{2}^{2}\right)\left(z_{1}^{2}, \sqrt{2} z_{1} z_{2}, z_{2}^{2}\right)^{\mathrm{T}} \\ &=\phi(\mathbf{x})^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{z}) \end{aligned}\).

    where \(\phi(\mathrm{x})=\left(x_{1}^{2}, \sqrt{2} x_{1} x_{2}, x_{2}^{2}\right)^{\mathrm{T}}\)


Necessary and sufficient condition for \(k(x,x')\) :

  • positive semidefinite!


Can build new kernels, by using simpler kernels as building blocks

\[\begin{aligned} k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=c k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=f(\mathbf{x}) k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) f\left(\mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=q\left(k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=\exp \left(k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)+k_{2}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) k_{2}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{3}\left(\boldsymbol{\phi}(\mathbf{x}), \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=\mathbf{x}^{\mathrm{T}} \mathbf{A} \mathbf{x}^{\prime} \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{a}\left(\mathbf{x}_{a}, \mathbf{x}_{a}^{\prime}\right)+k_{b}\left(\mathbf{x}_{b}, \mathbf{x}_{b}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{a}\left(\mathbf{x}_{a}, \mathbf{x}_{a}^{\prime}\right) k_{b}\left(\mathbf{x}_{b}, \mathbf{x}_{b}^{\prime}\right) \end{aligned}\]


Requires kernel function ( = \(k(x,x')\) to be…

  • 1) symmetric
  • 2) positive semidefinite
  • 3) express the appropriate form of similarity between \(x\) and \(x'\)


Gaussian kernel

  • \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\exp \left(-\left\|\mathbf{x}-\mathbf{x}^{\prime}\right\|^{2} / 2 \sigma^{2}\right)\).

  • \(\left\|\mathrm{x}-\mathrm{x}^{\prime}\right\|^{2}=\mathrm{x}^{\mathrm{T}} \mathrm{x}+\left(\mathrm{x}^{\prime}\right)^{\mathrm{T}} \mathrm{x}^{\prime}-2 \mathrm{x}^{\mathrm{T}} \mathrm{x}^{\prime}\).

  • \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\exp \left(-\mathbf{x}^{\mathrm{T}} \mathbf{x} / 2 \sigma^{2}\right) \exp \left(\mathbf{x}^{\mathrm{T}} \mathbf{x}^{\prime} / \sigma^{2}\right) \exp \left(-\left(\mathbf{x}^{\prime}\right)^{\mathrm{T}} \mathbf{x}^{\prime} / 2 \sigma^{2}\right)\).

  • not only restricted to Euclidean distance!

    \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\exp \left\{-\frac{1}{2 \sigma^{2}}\left(\kappa(\mathbf{x}, \mathbf{x})+\kappa\left(\mathbf{x}^{\prime}, \mathbf{x}^{\prime}\right)-2 \kappa\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\right)\right\}\).


Generative vs Discriminative models

  • Generative : can deal with missing data
  • Discriminative : better performance on discriminative tasks

\(\rightarrow\) combine! Use generative model to define a kernel, then use this kernel in discriminative approach


Generative model (1) Intro & HMM

  • \(k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=p(\mathrm{x}) p\left(\mathrm{x}^{\prime}\right)\).

  • extend this class of kernels!

    \(k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=\sum_{i} p(\mathrm{x} \mid i) p\left(\mathrm{x}^{\prime} \mid i\right) p(i)\).

    • mixture distribution
    • \(i\) : role of ‘latent variable’
  • limit of infinite sum :

    \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\int p(\mathbf{x} \mid \mathbf{z}) p\left(\mathbf{x}^{\prime} \mid \mathbf{z}\right) p(\mathbf{z}) \mathrm{d} \mathbf{z}\).

    ( \(z\) : continuous latent variable )

  • popular generative model for sequence : HMM (Hidden Markov Model)

    \(k\left(\mathbf{X}, \mathbf{X}^{\prime}\right)=\sum_{\mathbf{Z}} p(\mathbf{X} \mid \mathbf{Z}) p\left(\mathbf{X}^{\prime} \mid \mathbf{Z}\right) p(\mathbf{Z})\).

    • expresses the distribution \(p(\mathbf{X})\)

    • hidden states \(\mathrm{Z}=\left\{\mathrm{z}_{1}, \ldots, \mathrm{z}_{L}\right\}\)


Generative model (2) Fisher Kernel

  • Fisher score : \(\mathrm{g}(\boldsymbol{\theta}, \mathbf{x})=\nabla_{\boldsymbol{\theta}} \ln p(\mathbf{x} \mid \boldsymbol{\theta})\)

  • Fisher Kernel : \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\mathbf{g}(\boldsymbol{\theta}, \mathbf{x})^{\mathrm{T}} \mathbf{F}^{-1} \mathbf{g}\left(\boldsymbol{\theta}, \mathbf{x}^{\prime}\right)\)

    • Fisher Information Matrix : \(\mathbf{F}=\mathbb{E}_{\mathbf{x}}\left[\mathbf{g}(\boldsymbol{\theta}, \mathbf{x}) \mathbf{g}(\boldsymbol{\theta}, \mathbf{x})^{\mathrm{T}}\right]\)

      ( in practice, it is infeasible to evaluate. Thus use sample average! )

      \(\mathbf{F} \simeq \frac{1}{N} \sum_{n=1}^{N} \mathbf{g}\left(\boldsymbol{\theta}, \mathbf{x}_{n}\right) \mathbf{g}\left(\boldsymbol{\theta}, \mathbf{x}_{n}\right)^{\mathrm{T}}\).

  • Fisher Information = covariance matrix of Fisher score

  • simply, we can omit \(\mathbf{F}\)

    \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\mathbf{g}(\boldsymbol{\theta}, \mathbf{x})^{\mathrm{T}} \mathbf{g}\left(\boldsymbol{\theta}, \mathbf{x}^{\prime}\right)\).


Sigmoidal Kernel

  • \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\tanh \left(a \mathbf{x}^{\mathrm{T}} \mathbf{x}^{\prime}+b\right)\).

  • Gram matrix is not positive semi-definite

  • but can be used in practice!

    \(\rightarrow\) gives kernel expansions

  • infinite number of basis function, BNN = GP

6-3. Radial Basis Function Networks

RBF (Radial Baiss Functions)

  • depends only on the radial distance from a center

    that is, \(\phi_{j}(\mathbf{x})=h\left(\left\|\mathbf{x}-\boldsymbol{\mu}_{j}\right\|\right)\)


First introduced for “interpolation”

  • goal is to find smooth function \(f(x)\)

  • express with a linear combination of RBF

    \(f(\mathrm{x})=\sum_{n=1}^{N} w_{n} h\left(\left\|\mathrm{x}-\mathrm{x}_{n}\right\|\right)\).


Other applications

  • regularization theory, interpolation problem when the input is noisy….

6-4. Gaussian Process

(until now) apply duality to a non-probabilistic model for regression

(from now) extend it to “probabilistic discriminative model”

  • (1) prior over \(w\)
  • (2) find posterior over \(w\)
  • (3) predictive distribution \(p(t\mid x)\)


GP : dispense with parametric model. Instead, define a “prior over functions” directly!!

6-4-1. Linear Regression revisited

\[y(\mathrm{x})=\mathrm{w}^{\mathrm{T}} \phi(\mathrm{x})\]
  • prior : \(p(\mathbf{w})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{0}, \alpha^{-1} \mathbf{I}\right)\).


joint distribution of \(y(x_1,...x_N)\) : \(\mathbf{y}=\Phi \mathbf{w}\)

  • \(\Phi\) : design matrix, with elements \(\Phi_{n k}=\phi_{k}\left(\mathrm{x}_{n}\right)\)
  • mean and covariance
    • mean : \(\mathbb{E}[\mathbf{y}] =\Phi \mathbb{E}[\mathbf{w}]=0\)
    • covariance : \(\operatorname{cov}[\mathbf{y}] =\mathbb{E}\left[\mathbf{y} \mathbf{y}^{\mathrm{T}}\right]=\mathbf{\Phi} \mathbb{E}\left[\mathbf{w} \mathbf{w}^{\mathrm{T}}\right] \mathbf{\Phi}^{\mathrm{T}}=\frac{1}{\alpha} \mathbf{\Phi} \Phi^{\mathrm{T}}=\mathbf{K}\)
      • \(K\) : Gram matrix with elements \(K_{n m}=k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=\frac{1}{\alpha} \phi\left(\mathbf{x}_{n}\right)^{\mathrm{T}} \phi\left(\mathbf{x}_{m}\right)\)


Define kernel function “directly”

( rather than indirectly through the choice of basis function )

ex) \(k(x, x^{\prime})=\exp (-\theta\mid x-x^{\prime}\mid)\)

6-4-2. Gaussian Processes for Regression ( GPR )

observed target variable : \(t_{n}=y_{n}+\epsilon_{n}\).


\(p\left(t_{n} \mid y_{n}\right)=\mathcal{N}\left(t_{n} \mid y_{n}, \beta^{-1}\right)\).

\(p(\mathbf{t} \mid \mathbf{y})=\mathcal{N}\left(\mathbf{t} \mid \mathbf{y}, \beta^{-1} \mathbf{I}_{N}\right)\).

  • precision of the noise : \(\beta\)


Marginal distribution

  • \(p(\mathbf{y})=\mathcal{N}(\mathbf{y} \mid \mathbf{0}, \mathbf{K})\).

  • \(\mathbf{K}\) : Gram matrix


In order to find the marginal distribution of \(p(\mathbf{t})\):

  • need to integrate over \(y\)
  • \(p(\mathbf{t})=\int p(\mathbf{t} \mid \mathbf{y}) p(\mathbf{y}) \mathrm{d} \mathbf{y}=\mathcal{N}(\mathbf{t} \mid \mathbf{0}, \mathbf{C})\).
    • covariance matrix \(\mathbf{C}\) : \(C\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)+\beta^{-1} \delta_{n m}\)


widely used kernel for GPR : exponential of a quadratic form

\[k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=\theta_{0} \exp \left\{-\frac{\theta_{1}}{2}\left\|\mathbf{x}_{n}-\mathbf{x}_{m}\right\|^{2}\right\}+\theta_{2}+\theta_{3} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m}\]


Predictive Distribution

  • goal in regression : make prediction of target variables for “new inputs”
  • require that we evaluate predictive distribution, \(p\left(t_{N+1} \mid \mathbf{t}_{N}\right)\)


How to find \(p\left(t_{N+1} \mid \mathbf{t}_{N}\right)\) ?

[step 1] joint distribution \(p\left(\mathbf{t}_{N+1}\right),\)

( where \(\mathbf{t}_{N+1}\) is \(\left(t_{1}, \ldots, t_{N}, t_{N+1}\right)^{\mathrm{T}}\) )

\(p\left(\mathbf{t}_{N+1}\right)=\mathcal{N}\left(\mathbf{t}_{N+1} \mid \mathbf{0}, \mathbf{C}_{N+1}\right)\). where \(\mathbf{C}_{N+1}=\left(\begin{array}{cc} \mathbf{C}_{N} & \mathbf{k} \\ \mathbf{k}^{\mathrm{T}} & c \end{array}\right)\)

  • vector \(\mathrm{k}\) has elements \(k\left(\mathrm{x}_{n}, \mathrm{x}_{N+1}\right)\)
  • \(c=k\left(\mathrm{x}_{N+1}, \mathrm{x}_{N+1}\right)+\beta^{-1}\).


[step 2] find \(p\left(t_{N+1} \mid \mathbf{t}_{N}\right)\)

Mean and covariance

  • mean : \(m\left(\mathrm{x}_{N+1}\right) =\mathrm{k}^{\mathrm{T}} \mathrm{C}_{N}^{-1} \mathrm{t}\)
  • covariance : \(\sigma^{2}\left(\mathrm{x}_{N+1}\right) =c-\mathrm{k}^{\mathrm{T}} \mathrm{C}_{N}^{-1} \mathrm{k}\)


Can also rewrite mean as…

\[m\left(\mathbf{x}_{N+1}\right)=\sum_{n=1}^{N} a_{n} k\left(\mathbf{x}_{n}, \mathbf{x}_{N+1}\right)\]
  • \(a_n\) : \(n^{th}\) component of \(\mathrm{C}_{N}^{-1} \mathrm{t}\).
  • if \(k\left(\mathrm{x}_{n}, \mathrm{x}_{m}\right)\) depends only on $$ then we obtain an expansion in RBF


Computational Operation

  • GP : \(O(N^3)\).

    • inversion of a matrix of size \(N \times N\)
  • Basis function model : \(O(M^3)\)

    • inversion of a matrix of size \(M \times M\)
  • both needs “inversion of matrix”

  • at test time..

    • GP : \(O(N^2)\)
    • BF : \(O(M^2)\)
  • If \(M\)<\(N\), it is more efficient to work in the basis function

    but, GP can consider covariance functions!


For large training datasets, direct application of GP is infeasible

\(\rightarrow\) approximations have been developed

6-4-3. Learning the hyperparameters

prediction of GP : depend on covariance function

  • instead of fixing covariance function (X)

  • use parametric family of functions \& infer parameters from data

    ( parameters : length scale of the correlation, precision of the noise … )


\[p(t \mid \theta)\]
  • \(\theta\) : hyperparamters of GP
  • simplest approach : maximize \(p(t\mid \theta)\)

  • \(\ln p(\mathbf{t} \mid \theta)=-\frac{1}{2} \ln \mid \mathbf{C}_{N}\mid-\frac{1}{2} \mathbf{t}^{\mathrm{T}} \mathbf{C}_{N}^{-1} \mathbf{t}-\frac{N}{2} \ln (2 \pi)\).

    \(\frac{\partial}{\partial \theta_{i}} \ln p(\mathbf{t} \mid \boldsymbol{\theta})=-\frac{1}{2} \operatorname{Tr}\left(\mathbf{C}_{N}^{-1} \frac{\partial \mathbf{C}_{N}}{\partial \theta_{i}}\right)+\frac{1}{2} \mathbf{t}^{\mathrm{T}} \mathbf{C}_{N}^{-1} \frac{\partial \mathbf{C}_{N}}{\partial \theta_{i}} \mathbf{C}_{N}^{-1} \mathbf{t}\).

    \(\rightarrow\) non-convex function!


In Bayesian treatment, exact marginalization is intractable

\(\rightarrow\) need approximation

6-4-4. Automatic relevance determination (ARD)

in 6-4-3, maximize likelihood to find hyperparameter!

this can be extended by “incorporating a separate parameter for each input variable”


Automatic relevance determination (ARD)

  • formulated in the framework of NN

  • example with 2d input \(\mathbf{x}=(x_1,x_2)\)

    \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\theta_{0} \exp \left\{-\frac{1}{2} \sum_{i=1}^{2} \eta_{i}\left(x_{i}-x_{i}^{\prime}\right)^{2}\right\}\).


Selecting important variables ( = discarding unimportant ones )

  • have \(\eta_i\) ( one for each input variable )

  • small \(\eta_i\) \(\rightarrow\) function becomes insensitive

  • It becomes possible to detect the input variables that have little effect on predictive distribution!

    ( since \(\eta_i\) will be small )

  • unimportant input (variables) will be discarded


ARD framework can be easily incorporated into the exponential-quadratic kernel

\[k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=\theta_{0} \exp \left\{-\frac{1}{2} \sum_{i=1}^{D} \eta_{i}\left(x_{n i}-x_{m i}\right)^{2}\right\}+\theta_{2}+\theta_{3} \sum_{i=1}^{D} x_{n i} x_{m i}\]

6-4-5. Gaussian Process for Classification

probabilities must lie in the interval (0,1)

\(\rightarrow\) by transforming the output of GP using an appropriate nonlinear activation function

( = logistic sigmoid \(y=\sigma(a)\) )


model (Bernoulli distn) : \(p(t \mid a)=\sigma(a)^{t}(1-\sigma(a))^{1-t}\)


GP prior : \(p\left(\mathbf{a}_{N+1}\right)=\mathcal{N}\left(\mathbf{a}_{N+1} \mid \mathbf{0}, \mathbf{C}_{N+1}\right)\)

  • \(C\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)+\nu \delta_{n m}\).
  • \(k\left(\mathrm{x}_{n}, \mathrm{x}_{m}\right)\) is any positive semidefinite kernel function
  • \(\nu\) is usually fixed in advance


Predictive distribution :

  • \(p\left(t_{N+1}\right.\left.=1 \mid \mathbf{t}_{N}\right)=\int p\left(t_{N+1}=1 \mid a_{N+1}\right) p\left(a_{N+1} \mid \mathbf{t}_{N}\right) \mathrm{d} a_{N+1}\).

    where \(p\left(t_{N+1}\right.\left.=1 \mid a_{N+1}\right)=\sigma\left(a_{N+1}\right)\)


3 different approaches to obtain Gaussian Approximation

  • (1) Variational Inference
  • (2) Expectation Propagation
  • (3) Laplace Approximation