( Skip the basic parts + not important contents )

6. Kernel Methods

Sometimes, training data ( or subset of them ) are kept during the prediction phase!

Memory based methods

involves storing the entire training set in order to make future prediction

ex) Kernel function, KNN
generally FAST to train, SLOW to predict test data

Kernel function

$k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\phi(\mathbf{x})^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right)$.
symmetric function : $k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=k\left(\mathrm{x}^{\prime}, \mathrm{x}\right)$
used in SVM ( Boser et al. ,1992 )

Kernel Trick ( = Kernel Substitution )

ex) can be applied to…
- nonlinear variant of PCA
- kernel Fisher discriminant

Different types of kernel

linear kernel : $\phi(\mathrm{x})=\mathrm{x}$ $\rightarrow$ $k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=\mathrm{x}^{\mathrm{T}} \mathrm{x}^{\prime}$
(property) stationary : $k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=k\left(\mathbf{x}-\mathbf{x}^{\prime}\right),$
- ex) rbf function( radial basis function, homogeneous kernels )
  
  $k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=k\left(\left\|\mathrm{x}-\mathrm{x}^{\prime}\right\|\right)$.

6-1. Dual Representations

minimize SSE function :

$J(\mathbf{w})=\frac{1}{2} \sum_{n=1}^{N}\left\{\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)-t_{n}\right\}^{2}+\frac{\lambda}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}$.

[step 1] take derivative \& set it to zero :

$\mathbf{w}=-\frac{1}{\lambda} \sum_{n=1}^{N}\left\{\mathbf{w}^{\mathrm{T}} \phi\left(\mathbf{x}_{n}\right)-t_{n}\right\} \phi\left(\mathbf{x}_{n}\right)=\sum_{n=1}^{N} a_{n} \phi\left(\mathbf{x}_{n}\right)=\mathbf{\Phi}^{\mathrm{T}} \mathbf{a}$.

where $a_{n}=-\frac{1}{\lambda}\left\{\mathbf{w}^{\mathrm{T}} \phi\left(\mathbf{x}_{n}\right)-t_{n}\right\}$

[step 2] Instead of working with $\mathbf{w}$, use $\mathbf{a}$ ( = substitute $\mathrm{w}=\Phi^{\mathrm{T}}$ in $J(\mathrm{w})$ )

$J(\mathbf{a})=\frac{1}{2} \mathbf{a}^{\mathrm{T}} \Phi \Phi^{\mathrm{T}} \Phi \Phi^{\mathrm{T}} \mathbf{a}-\mathbf{a}^{\mathrm{T}} \Phi \Phi^{\mathrm{T}} \mathbf{t}+\frac{1}{2} \mathbf{t}^{\mathrm{T}} \mathbf{t}+\frac{\lambda}{2} \mathbf{a}^{\mathrm{T}} \Phi \Phi^{\mathrm{T}} \mathbf{a}$ .

[step 3] Use Gram matrix $\mathbf{K}=\Phi \Phi^{\mathrm{T}}$

$J(\mathbf{a})=\frac{1}{2} \mathbf{a}^{\mathrm{T}} \mathbf{K} \mathbf{K} \mathbf{a}-\mathbf{a}^{\mathrm{T}} \mathbf{K} \mathbf{t}+\frac{1}{2} \mathbf{t}^{\mathrm{T}} \mathbf{t}+\frac{\lambda}{2} \mathbf{a}^{\mathrm{T}} \mathbf{K} \mathbf{a}$.

[step 4] Solution ( $\mathbf{a}$ )

$\mathbf{a}=\left(\mathbf{K}+\lambda \mathbf{I}_{N}\right)^{-1} \mathbf{t}$.

[step 5] Prediction for new input $\mathbf{x}$

substitute this back into LR model
$y(\mathbf{x})=\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{x})=\mathbf{a}^{\mathrm{T}} \boldsymbol{\Phi} \boldsymbol{\phi}(\mathbf{x})=\mathbf{k}(\mathbf{x})^{\mathrm{T}}\left(\mathbf{K}+\lambda \mathbf{I}_{N}\right)^{-1} \mathbf{t}$.
- $\mathrm{k}(\mathrm{x})$ with elements $k_{n}(\mathrm{x})=k\left(\mathrm{x}_{n}, \mathrm{x}\right)$

Dual formulation allows us to “express entirely in terms of KERNL FUNCTION $k(x,x')$

6-2. Constructing Kernels

\[k\left(x, x^{\prime}\right)=\phi(x)^{\mathrm{T}} \phi\left(x^{\prime}\right)=\sum_{i=1}^{M} \phi_{i}(x) \phi_{i}\left(x^{\prime}\right)\]

$\phi_{i}(x)$ are the basis function
example) $k(\mathbf{x}, \mathbf{z})=\left(\mathbf{x}^{\mathrm{T}} \mathbf{z}\right)^{2}$

$\begin{aligned} k(\mathbf{x}, \mathbf{z}) &=\left(\mathbf{x}^{\mathrm{T}} \mathbf{z}\right)^{2}=\left(x_{1} z_{1}+x_{2} z_{2}\right)^{2} \\ &=x_{1}^{2} z_{1}^{2}+2 x_{1} z_{1} x_{2} z_{2}+x_{2}^{2} z_{2}^{2} \\ &=\left(x_{1}^{2} \sqrt{2} x_{1} x_{2}, x_{2}^{2}\right)\left(z_{1}^{2}, \sqrt{2} z_{1} z_{2}, z_{2}^{2}\right)^{\mathrm{T}} \\ &=\phi(\mathbf{x})^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{z}) \end{aligned}$.

where $\phi(\mathrm{x})=\left(x_{1}^{2}, \sqrt{2} x_{1} x_{2}, x_{2}^{2}\right)^{\mathrm{T}}$

Necessary and sufficient condition for $k(x,x')$ :

positive semidefinite!

Can build new kernels, by using simpler kernels as building blocks

\[\begin{aligned} k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=c k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=f(\mathbf{x}) k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) f\left(\mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=q\left(k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=\exp \left(k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)+k_{2}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) k_{2}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{3}\left(\boldsymbol{\phi}(\mathbf{x}), \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=\mathbf{x}^{\mathrm{T}} \mathbf{A} \mathbf{x}^{\prime} \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{a}\left(\mathbf{x}_{a}, \mathbf{x}_{a}^{\prime}\right)+k_{b}\left(\mathbf{x}_{b}, \mathbf{x}_{b}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{a}\left(\mathbf{x}_{a}, \mathbf{x}_{a}^{\prime}\right) k_{b}\left(\mathbf{x}_{b}, \mathbf{x}_{b}^{\prime}\right) \end{aligned}\]

Requires kernel function ( = $k(x,x')$ to be…

1) symmetric
2) positive semidefinite
3) express the appropriate form of similarity between $x$ and $x'$

Gaussian kernel

$k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\exp \left(-\left\|\mathbf{x}-\mathbf{x}^{\prime}\right\|^{2} / 2 \sigma^{2}\right)$.
$\left\|\mathrm{x}-\mathrm{x}^{\prime}\right\|^{2}=\mathrm{x}^{\mathrm{T}} \mathrm{x}+\left(\mathrm{x}^{\prime}\right)^{\mathrm{T}} \mathrm{x}^{\prime}-2 \mathrm{x}^{\mathrm{T}} \mathrm{x}^{\prime}$.
$k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\exp \left(-\mathbf{x}^{\mathrm{T}} \mathbf{x} / 2 \sigma^{2}\right) \exp \left(\mathbf{x}^{\mathrm{T}} \mathbf{x}^{\prime} / \sigma^{2}\right) \exp \left(-\left(\mathbf{x}^{\prime}\right)^{\mathrm{T}} \mathbf{x}^{\prime} / 2 \sigma^{2}\right)$.
not only restricted to Euclidean distance!

$k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\exp \left\{-\frac{1}{2 \sigma^{2}}\left(\kappa(\mathbf{x}, \mathbf{x})+\kappa\left(\mathbf{x}^{\prime}, \mathbf{x}^{\prime}\right)-2 \kappa\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\right)\right\}$.

Generative vs Discriminative models

Generative : can deal with missing data
Discriminative : better performance on discriminative tasks

$\rightarrow$ combine! Use generative model to define a kernel, then use this kernel in discriminative approach

Generative model (1) Intro & HMM

$k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=p(\mathrm{x}) p\left(\mathrm{x}^{\prime}\right)$.
extend this class of kernels!

$k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=\sum_{i} p(\mathrm{x} \mid i) p\left(\mathrm{x}^{\prime} \mid i\right) p(i)$.
- mixture distribution
- $i$ : role of ‘latent variable’
limit of infinite sum :

$k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\int p(\mathbf{x} \mid \mathbf{z}) p\left(\mathbf{x}^{\prime} \mid \mathbf{z}\right) p(\mathbf{z}) \mathrm{d} \mathbf{z}$.

( $z$ : continuous latent variable )
popular generative model for sequence : HMM (Hidden Markov Model)

$k\left(\mathbf{X}, \mathbf{X}^{\prime}\right)=\sum_{\mathbf{Z}} p(\mathbf{X} \mid \mathbf{Z}) p\left(\mathbf{X}^{\prime} \mid \mathbf{Z}\right) p(\mathbf{Z})$.
- expresses the distribution $p(\mathbf{X})$
- hidden states $\mathrm{Z}=\left\{\mathrm{z}_{1}, \ldots, \mathrm{z}_{L}\right\}$

Generative model (2) Fisher Kernel

Fisher score : $\mathrm{g}(\boldsymbol{\theta}, \mathbf{x})=\nabla_{\boldsymbol{\theta}} \ln p(\mathbf{x} \mid \boldsymbol{\theta})$
Fisher Kernel : $k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\mathbf{g}(\boldsymbol{\theta}, \mathbf{x})^{\mathrm{T}} \mathbf{F}^{-1} \mathbf{g}\left(\boldsymbol{\theta}, \mathbf{x}^{\prime}\right)$
- Fisher Information Matrix : $\mathbf{F}=\mathbb{E}_{\mathbf{x}}\left[\mathbf{g}(\boldsymbol{\theta}, \mathbf{x}) \mathbf{g}(\boldsymbol{\theta}, \mathbf{x})^{\mathrm{T}}\right]$
  
  ( in practice, it is infeasible to evaluate. Thus use sample average! )
  
  $\mathbf{F} \simeq \frac{1}{N} \sum_{n=1}^{N} \mathbf{g}\left(\boldsymbol{\theta}, \mathbf{x}_{n}\right) \mathbf{g}\left(\boldsymbol{\theta}, \mathbf{x}_{n}\right)^{\mathrm{T}}$.
Fisher Information = covariance matrix of Fisher score
simply, we can omit $\mathbf{F}$

$k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\mathbf{g}(\boldsymbol{\theta}, \mathbf{x})^{\mathrm{T}} \mathbf{g}\left(\boldsymbol{\theta}, \mathbf{x}^{\prime}\right)$.

Sigmoidal Kernel

$k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\tanh \left(a \mathbf{x}^{\mathrm{T}} \mathbf{x}^{\prime}+b\right)$.
Gram matrix is not positive semi-definite
but can be used in practice!

$\rightarrow$ gives kernel expansions
infinite number of basis function, BNN = GP

6-3. Radial Basis Function Networks

RBF (Radial Baiss Functions)

depends only on the radial distance from a center

that is, $\phi_{j}(\mathbf{x})=h\left(\left\|\mathbf{x}-\boldsymbol{\mu}_{j}\right\|\right)$

First introduced for “interpolation”

goal is to find smooth function $f(x)$
express with a linear combination of RBF

$f(\mathrm{x})=\sum_{n=1}^{N} w_{n} h\left(\left\|\mathrm{x}-\mathrm{x}_{n}\right\|\right)$.

Other applications

regularization theory, interpolation problem when the input is noisy….

6-4. Gaussian Process

(until now) apply duality to a non-probabilistic model for regression

(from now) extend it to “probabilistic discriminative model”

(1) prior over $w$
(2) find posterior over $w$
(3) predictive distribution $p(t\mid x)$

GP : dispense with parametric model. Instead, define a “prior over functions” directly!!

6-4-1. Linear Regression revisited

\[y(\mathrm{x})=\mathrm{w}^{\mathrm{T}} \phi(\mathrm{x})\]

prior : $p(\mathbf{w})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{0}, \alpha^{-1} \mathbf{I}\right)$.

joint distribution of $y(x_1,...x_N)$ : $\mathbf{y}=\Phi \mathbf{w}$

$\Phi$ : design matrix, with elements $\Phi_{n k}=\phi_{k}\left(\mathrm{x}_{n}\right)$
mean and covariance
- mean : $\mathbb{E}[\mathbf{y}] =\Phi \mathbb{E}[\mathbf{w}]=0$
- covariance : $\operatorname{cov}[\mathbf{y}] =\mathbb{E}\left[\mathbf{y} \mathbf{y}^{\mathrm{T}}\right]=\mathbf{\Phi} \mathbb{E}\left[\mathbf{w} \mathbf{w}^{\mathrm{T}}\right] \mathbf{\Phi}^{\mathrm{T}}=\frac{1}{\alpha} \mathbf{\Phi} \Phi^{\mathrm{T}}=\mathbf{K}$
  - $K$ : Gram matrix with elements $K_{n m}=k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=\frac{1}{\alpha} \phi\left(\mathbf{x}_{n}\right)^{\mathrm{T}} \phi\left(\mathbf{x}_{m}\right)$

Define kernel function “directly”

( rather than indirectly through the choice of basis function )

ex) $k(x, x^{\prime})=\exp (-\theta\mid x-x^{\prime}\mid)$

6-4-2. Gaussian Processes for Regression ( GPR )

observed target variable : $t_{n}=y_{n}+\epsilon_{n}$.

$p\left(t_{n} \mid y_{n}\right)=\mathcal{N}\left(t_{n} \mid y_{n}, \beta^{-1}\right)$.

$p(\mathbf{t} \mid \mathbf{y})=\mathcal{N}\left(\mathbf{t} \mid \mathbf{y}, \beta^{-1} \mathbf{I}_{N}\right)$.

precision of the noise : $\beta$

Marginal distribution

$p(\mathbf{y})=\mathcal{N}(\mathbf{y} \mid \mathbf{0}, \mathbf{K})$.
$\mathbf{K}$ : Gram matrix

In order to find the marginal distribution of $p(\mathbf{t})$:

need to integrate over $y$
$p(\mathbf{t})=\int p(\mathbf{t} \mid \mathbf{y}) p(\mathbf{y}) \mathrm{d} \mathbf{y}=\mathcal{N}(\mathbf{t} \mid \mathbf{0}, \mathbf{C})$.
- covariance matrix $\mathbf{C}$ : $C\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)+\beta^{-1} \delta_{n m}$

widely used kernel for GPR : exponential of a quadratic form

\[k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=\theta_{0} \exp \left\{-\frac{\theta_{1}}{2}\left\|\mathbf{x}_{n}-\mathbf{x}_{m}\right\|^{2}\right\}+\theta_{2}+\theta_{3} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m}\]

Predictive Distribution

goal in regression : make prediction of target variables for “new inputs”
require that we evaluate predictive distribution, $p\left(t_{N+1} \mid \mathbf{t}_{N}\right)$

How to find $p\left(t_{N+1} \mid \mathbf{t}_{N}\right)$ ?

[step 1] joint distribution $p\left(\mathbf{t}_{N+1}\right),$

( where $\mathbf{t}_{N+1}$ is $\left(t_{1}, \ldots, t_{N}, t_{N+1}\right)^{\mathrm{T}}$ )

$p\left(\mathbf{t}_{N+1}\right)=\mathcal{N}\left(\mathbf{t}_{N+1} \mid \mathbf{0}, \mathbf{C}_{N+1}\right)$. where $\mathbf{C}_{N+1}=\left(\begin{array}{cc} \mathbf{C}_{N} & \mathbf{k} \\ \mathbf{k}^{\mathrm{T}} & c \end{array}\right)$

vector $\mathrm{k}$ has elements $k\left(\mathrm{x}_{n}, \mathrm{x}_{N+1}\right)$
$c=k\left(\mathrm{x}_{N+1}, \mathrm{x}_{N+1}\right)+\beta^{-1}$.

[step 2] find $p\left(t_{N+1} \mid \mathbf{t}_{N}\right)$

Mean and covariance

mean : $m\left(\mathrm{x}_{N+1}\right) =\mathrm{k}^{\mathrm{T}} \mathrm{C}_{N}^{-1} \mathrm{t}$
covariance : $\sigma^{2}\left(\mathrm{x}_{N+1}\right) =c-\mathrm{k}^{\mathrm{T}} \mathrm{C}_{N}^{-1} \mathrm{k}$

Can also rewrite mean as…

\[m\left(\mathbf{x}_{N+1}\right)=\sum_{n=1}^{N} a_{n} k\left(\mathbf{x}_{n}, \mathbf{x}_{N+1}\right)\]

$a_n$ : $n^{th}$ component of $\mathrm{C}_{N}^{-1} \mathrm{t}$.
if $k\left(\mathrm{x}_{n}, \mathrm{x}_{m}\right)$ depends only on $$ then we obtain an expansion in RBF

Computational Operation

GP : $O(N^3)$.
- inversion of a matrix of size $N \times N$
Basis function model : $O(M^3)$
- inversion of a matrix of size $M \times M$
both needs “inversion of matrix”
at test time..
- GP : $O(N^2)$
- BF : $O(M^2)$
If $M$<$N$, it is more efficient to work in the basis function

but, GP can consider covariance functions!

For large training datasets, direct application of GP is infeasible

$\rightarrow$ approximations have been developed

6-4-3. Learning the hyperparameters

prediction of GP : depend on covariance function

instead of fixing covariance function (X)
use parametric family of functions \& infer parameters from data

( parameters : length scale of the correlation, precision of the noise … )

\[p(t \mid \theta)\]

$\theta$ : hyperparamters of GP
simplest approach : maximize $p(t\mid \theta)$
$\ln p(\mathbf{t} \mid \theta)=-\frac{1}{2} \ln \mid \mathbf{C}_{N}\mid-\frac{1}{2} \mathbf{t}^{\mathrm{T}} \mathbf{C}_{N}^{-1} \mathbf{t}-\frac{N}{2} \ln (2 \pi)$.

$\frac{\partial}{\partial \theta_{i}} \ln p(\mathbf{t} \mid \boldsymbol{\theta})=-\frac{1}{2} \operatorname{Tr}\left(\mathbf{C}_{N}^{-1} \frac{\partial \mathbf{C}_{N}}{\partial \theta_{i}}\right)+\frac{1}{2} \mathbf{t}^{\mathrm{T}} \mathbf{C}_{N}^{-1} \frac{\partial \mathbf{C}_{N}}{\partial \theta_{i}} \mathbf{C}_{N}^{-1} \mathbf{t}$.

$\rightarrow$ non-convex function!

In Bayesian treatment, exact marginalization is intractable

$\rightarrow$ need approximation

6-4-4. Automatic relevance determination (ARD)

in 6-4-3, maximize likelihood to find hyperparameter!

this can be extended by “incorporating a separate parameter for each input variable”

Automatic relevance determination (ARD)

formulated in the framework of NN
example with 2d input $\mathbf{x}=(x_1,x_2)$

$k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\theta_{0} \exp \left\{-\frac{1}{2} \sum_{i=1}^{2} \eta_{i}\left(x_{i}-x_{i}^{\prime}\right)^{2}\right\}$.

Selecting important variables ( = discarding unimportant ones )

have $\eta_i$ ( one for each input variable )
small $\eta_i$ $\rightarrow$ function becomes insensitive
It becomes possible to detect the input variables that have little effect on predictive distribution!

( since $\eta_i$ will be small )
unimportant input (variables) will be discarded

ARD framework can be easily incorporated into the exponential-quadratic kernel

\[k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=\theta_{0} \exp \left\{-\frac{1}{2} \sum_{i=1}^{D} \eta_{i}\left(x_{n i}-x_{m i}\right)^{2}\right\}+\theta_{2}+\theta_{3} \sum_{i=1}^{D} x_{n i} x_{m i}\]

6-4-5. Gaussian Process for Classification

probabilities must lie in the interval (0,1)

$\rightarrow$ by transforming the output of GP using an appropriate nonlinear activation function

( = logistic sigmoid $y=\sigma(a)$ )

model (Bernoulli distn) : $p(t \mid a)=\sigma(a)^{t}(1-\sigma(a))^{1-t}$

GP prior : $p\left(\mathbf{a}_{N+1}\right)=\mathcal{N}\left(\mathbf{a}_{N+1} \mid \mathbf{0}, \mathbf{C}_{N+1}\right)$

$C\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)+\nu \delta_{n m}$.
$k\left(\mathrm{x}_{n}, \mathrm{x}_{m}\right)$ is any positive semidefinite kernel function
$\nu$ is usually fixed in advance

Predictive distribution :

$p\left(t_{N+1}\right.\left.=1 \mid \mathbf{t}_{N}\right)=\int p\left(t_{N+1}=1 \mid a_{N+1}\right) p\left(a_{N+1} \mid \mathbf{t}_{N}\right) \mathrm{d} a_{N+1}$.

where $p\left(t_{N+1}\right.\left.=1 \mid a_{N+1}\right)=\sigma\left(a_{N+1}\right)$

3 different approaches to obtain Gaussian Approximation

(1) Variational Inference
(2) Expectation Propagation
(3) Laplace Approximation

Twitter Facebook LinkedIn

(PRML) Ch6.Kernel Methods

Seunghan Lee