( Skip the basic parts + not important contents )
6. Kernel Methods
Sometimes, training data ( or subset of them ) are kept during the prediction phase!
Memory based methods
-
involves storing the entire training set in order to make future prediction
ex) Kernel function, KNN
-
generally FAST to train, SLOW to predict test data
Kernel function
- \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\phi(\mathbf{x})^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right)\).
- symmetric function : \(k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=k\left(\mathrm{x}^{\prime}, \mathrm{x}\right)\)
- used in SVM ( Boser et al. ,1992 )
Kernel Trick ( = Kernel Substitution )
-
ex) can be applied to…
- nonlinear variant of PCA
- kernel Fisher discriminant
Different types of kernel
-
linear kernel : \(\phi(\mathrm{x})=\mathrm{x}\) \(\rightarrow\) \(k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=\mathrm{x}^{\mathrm{T}} \mathrm{x}^{\prime}\)
-
(property) stationary : \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=k\left(\mathbf{x}-\mathbf{x}^{\prime}\right),\)
-
ex) rbf function( radial basis function, homogeneous kernels )
\(k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=k\left(\left\|\mathrm{x}-\mathrm{x}^{\prime}\right\|\right)\).
-
6-1. Dual Representations
minimize SSE function :
- \(J(\mathbf{w})=\frac{1}{2} \sum_{n=1}^{N}\left\{\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)-t_{n}\right\}^{2}+\frac{\lambda}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}\).
[step 1] take derivative \& set it to zero :
-
\(\mathbf{w}=-\frac{1}{\lambda} \sum_{n=1}^{N}\left\{\mathbf{w}^{\mathrm{T}} \phi\left(\mathbf{x}_{n}\right)-t_{n}\right\} \phi\left(\mathbf{x}_{n}\right)=\sum_{n=1}^{N} a_{n} \phi\left(\mathbf{x}_{n}\right)=\mathbf{\Phi}^{\mathrm{T}} \mathbf{a}\).
where \(a_{n}=-\frac{1}{\lambda}\left\{\mathbf{w}^{\mathrm{T}} \phi\left(\mathbf{x}_{n}\right)-t_{n}\right\}\)
[step 2] Instead of working with \(\mathbf{w}\), use \(\mathbf{a}\) ( = substitute \(\mathrm{w}=\Phi^{\mathrm{T}}\) in \(J(\mathrm{w})\) )
- \(J(\mathbf{a})=\frac{1}{2} \mathbf{a}^{\mathrm{T}} \Phi \Phi^{\mathrm{T}} \Phi \Phi^{\mathrm{T}} \mathbf{a}-\mathbf{a}^{\mathrm{T}} \Phi \Phi^{\mathrm{T}} \mathbf{t}+\frac{1}{2} \mathbf{t}^{\mathrm{T}} \mathbf{t}+\frac{\lambda}{2} \mathbf{a}^{\mathrm{T}} \Phi \Phi^{\mathrm{T}} \mathbf{a}\) .
[step 3] Use Gram matrix \(\mathbf{K}=\Phi \Phi^{\mathrm{T}}\)
- \(J(\mathbf{a})=\frac{1}{2} \mathbf{a}^{\mathrm{T}} \mathbf{K} \mathbf{K} \mathbf{a}-\mathbf{a}^{\mathrm{T}} \mathbf{K} \mathbf{t}+\frac{1}{2} \mathbf{t}^{\mathrm{T}} \mathbf{t}+\frac{\lambda}{2} \mathbf{a}^{\mathrm{T}} \mathbf{K} \mathbf{a}\).
[step 4] Solution ( \(\mathbf{a}\) )
- \(\mathbf{a}=\left(\mathbf{K}+\lambda \mathbf{I}_{N}\right)^{-1} \mathbf{t}\).
[step 5] Prediction for new input \(\mathbf{x}\)
- substitute this back into LR model
- \(y(\mathbf{x})=\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{x})=\mathbf{a}^{\mathrm{T}} \boldsymbol{\Phi} \boldsymbol{\phi}(\mathbf{x})=\mathbf{k}(\mathbf{x})^{\mathrm{T}}\left(\mathbf{K}+\lambda \mathbf{I}_{N}\right)^{-1} \mathbf{t}\).
- \(\mathrm{k}(\mathrm{x})\) with elements \(k_{n}(\mathrm{x})=k\left(\mathrm{x}_{n}, \mathrm{x}\right)\)
Dual formulation allows us to “express entirely in terms of KERNL FUNCTION \(k(x,x')\)
6-2. Constructing Kernels
\[k\left(x, x^{\prime}\right)=\phi(x)^{\mathrm{T}} \phi\left(x^{\prime}\right)=\sum_{i=1}^{M} \phi_{i}(x) \phi_{i}\left(x^{\prime}\right)\]-
\(\phi_{i}(x)\) are the basis function
-
example) \(k(\mathbf{x}, \mathbf{z})=\left(\mathbf{x}^{\mathrm{T}} \mathbf{z}\right)^{2}\)
\(\begin{aligned} k(\mathbf{x}, \mathbf{z}) &=\left(\mathbf{x}^{\mathrm{T}} \mathbf{z}\right)^{2}=\left(x_{1} z_{1}+x_{2} z_{2}\right)^{2} \\ &=x_{1}^{2} z_{1}^{2}+2 x_{1} z_{1} x_{2} z_{2}+x_{2}^{2} z_{2}^{2} \\ &=\left(x_{1}^{2} \sqrt{2} x_{1} x_{2}, x_{2}^{2}\right)\left(z_{1}^{2}, \sqrt{2} z_{1} z_{2}, z_{2}^{2}\right)^{\mathrm{T}} \\ &=\phi(\mathbf{x})^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{z}) \end{aligned}\).
where \(\phi(\mathrm{x})=\left(x_{1}^{2}, \sqrt{2} x_{1} x_{2}, x_{2}^{2}\right)^{\mathrm{T}}\)
Necessary and sufficient condition for \(k(x,x')\) :
- positive semidefinite!
Can build new kernels, by using simpler kernels as building blocks
\[\begin{aligned} k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=c k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=f(\mathbf{x}) k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) f\left(\mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=q\left(k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=\exp \left(k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)+k_{2}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) k_{2}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{3}\left(\boldsymbol{\phi}(\mathbf{x}), \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=\mathbf{x}^{\mathrm{T}} \mathbf{A} \mathbf{x}^{\prime} \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{a}\left(\mathbf{x}_{a}, \mathbf{x}_{a}^{\prime}\right)+k_{b}\left(\mathbf{x}_{b}, \mathbf{x}_{b}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{a}\left(\mathbf{x}_{a}, \mathbf{x}_{a}^{\prime}\right) k_{b}\left(\mathbf{x}_{b}, \mathbf{x}_{b}^{\prime}\right) \end{aligned}\]Requires kernel function ( = \(k(x,x')\) to be…
- 1) symmetric
- 2) positive semidefinite
- 3) express the appropriate form of similarity between \(x\) and \(x'\)
Gaussian kernel
-
\(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\exp \left(-\left\|\mathbf{x}-\mathbf{x}^{\prime}\right\|^{2} / 2 \sigma^{2}\right)\).
-
\(\left\|\mathrm{x}-\mathrm{x}^{\prime}\right\|^{2}=\mathrm{x}^{\mathrm{T}} \mathrm{x}+\left(\mathrm{x}^{\prime}\right)^{\mathrm{T}} \mathrm{x}^{\prime}-2 \mathrm{x}^{\mathrm{T}} \mathrm{x}^{\prime}\).
-
\(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\exp \left(-\mathbf{x}^{\mathrm{T}} \mathbf{x} / 2 \sigma^{2}\right) \exp \left(\mathbf{x}^{\mathrm{T}} \mathbf{x}^{\prime} / \sigma^{2}\right) \exp \left(-\left(\mathbf{x}^{\prime}\right)^{\mathrm{T}} \mathbf{x}^{\prime} / 2 \sigma^{2}\right)\).
-
not only restricted to Euclidean distance!
\(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\exp \left\{-\frac{1}{2 \sigma^{2}}\left(\kappa(\mathbf{x}, \mathbf{x})+\kappa\left(\mathbf{x}^{\prime}, \mathbf{x}^{\prime}\right)-2 \kappa\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\right)\right\}\).
Generative vs Discriminative models
- Generative : can deal with missing data
- Discriminative : better performance on discriminative tasks
\(\rightarrow\) combine! Use generative model to define a kernel, then use this kernel in discriminative approach
Generative model (1) Intro & HMM
-
\(k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=p(\mathrm{x}) p\left(\mathrm{x}^{\prime}\right)\).
-
extend this class of kernels!
\(k\left(\mathrm{x}, \mathrm{x}^{\prime}\right)=\sum_{i} p(\mathrm{x} \mid i) p\left(\mathrm{x}^{\prime} \mid i\right) p(i)\).
- mixture distribution
- \(i\) : role of ‘latent variable’
-
limit of infinite sum :
\(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\int p(\mathbf{x} \mid \mathbf{z}) p\left(\mathbf{x}^{\prime} \mid \mathbf{z}\right) p(\mathbf{z}) \mathrm{d} \mathbf{z}\).
( \(z\) : continuous latent variable )
-
popular generative model for sequence : HMM (Hidden Markov Model)
\(k\left(\mathbf{X}, \mathbf{X}^{\prime}\right)=\sum_{\mathbf{Z}} p(\mathbf{X} \mid \mathbf{Z}) p\left(\mathbf{X}^{\prime} \mid \mathbf{Z}\right) p(\mathbf{Z})\).
-
expresses the distribution \(p(\mathbf{X})\)
-
hidden states \(\mathrm{Z}=\left\{\mathrm{z}_{1}, \ldots, \mathrm{z}_{L}\right\}\)
-
Generative model (2) Fisher Kernel
-
Fisher score : \(\mathrm{g}(\boldsymbol{\theta}, \mathbf{x})=\nabla_{\boldsymbol{\theta}} \ln p(\mathbf{x} \mid \boldsymbol{\theta})\)
-
Fisher Kernel : \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\mathbf{g}(\boldsymbol{\theta}, \mathbf{x})^{\mathrm{T}} \mathbf{F}^{-1} \mathbf{g}\left(\boldsymbol{\theta}, \mathbf{x}^{\prime}\right)\)
-
Fisher Information Matrix : \(\mathbf{F}=\mathbb{E}_{\mathbf{x}}\left[\mathbf{g}(\boldsymbol{\theta}, \mathbf{x}) \mathbf{g}(\boldsymbol{\theta}, \mathbf{x})^{\mathrm{T}}\right]\)
( in practice, it is infeasible to evaluate. Thus use sample average! )
\(\mathbf{F} \simeq \frac{1}{N} \sum_{n=1}^{N} \mathbf{g}\left(\boldsymbol{\theta}, \mathbf{x}_{n}\right) \mathbf{g}\left(\boldsymbol{\theta}, \mathbf{x}_{n}\right)^{\mathrm{T}}\).
-
-
Fisher Information = covariance matrix of Fisher score
-
simply, we can omit \(\mathbf{F}\)
\(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\mathbf{g}(\boldsymbol{\theta}, \mathbf{x})^{\mathrm{T}} \mathbf{g}\left(\boldsymbol{\theta}, \mathbf{x}^{\prime}\right)\).
Sigmoidal Kernel
-
\(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\tanh \left(a \mathbf{x}^{\mathrm{T}} \mathbf{x}^{\prime}+b\right)\).
-
Gram matrix is not positive semi-definite
-
but can be used in practice!
\(\rightarrow\) gives kernel expansions
-
infinite number of basis function, BNN = GP
6-3. Radial Basis Function Networks
RBF (Radial Baiss Functions)
-
depends only on the radial distance from a center
that is, \(\phi_{j}(\mathbf{x})=h\left(\left\|\mathbf{x}-\boldsymbol{\mu}_{j}\right\|\right)\)
First introduced for “interpolation”
-
goal is to find smooth function \(f(x)\)
-
express with a linear combination of RBF
\(f(\mathrm{x})=\sum_{n=1}^{N} w_{n} h\left(\left\|\mathrm{x}-\mathrm{x}_{n}\right\|\right)\).
Other applications
- regularization theory, interpolation problem when the input is noisy….
6-4. Gaussian Process
(until now) apply duality to a non-probabilistic model for regression
(from now) extend it to “probabilistic discriminative model”
- (1) prior over \(w\)
- (2) find posterior over \(w\)
- (3) predictive distribution \(p(t\mid x)\)
GP : dispense with parametric model. Instead, define a “prior over functions” directly!!
6-4-1. Linear Regression revisited
\[y(\mathrm{x})=\mathrm{w}^{\mathrm{T}} \phi(\mathrm{x})\]- prior : \(p(\mathbf{w})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{0}, \alpha^{-1} \mathbf{I}\right)\).
joint distribution of \(y(x_1,...x_N)\) : \(\mathbf{y}=\Phi \mathbf{w}\)
- \(\Phi\) : design matrix, with elements \(\Phi_{n k}=\phi_{k}\left(\mathrm{x}_{n}\right)\)
- mean and covariance
- mean : \(\mathbb{E}[\mathbf{y}] =\Phi \mathbb{E}[\mathbf{w}]=0\)
- covariance : \(\operatorname{cov}[\mathbf{y}] =\mathbb{E}\left[\mathbf{y} \mathbf{y}^{\mathrm{T}}\right]=\mathbf{\Phi} \mathbb{E}\left[\mathbf{w} \mathbf{w}^{\mathrm{T}}\right] \mathbf{\Phi}^{\mathrm{T}}=\frac{1}{\alpha} \mathbf{\Phi} \Phi^{\mathrm{T}}=\mathbf{K}\)
- \(K\) : Gram matrix with elements \(K_{n m}=k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=\frac{1}{\alpha} \phi\left(\mathbf{x}_{n}\right)^{\mathrm{T}} \phi\left(\mathbf{x}_{m}\right)\)
Define kernel function “directly”
( rather than indirectly through the choice of basis function )
ex) \(k(x, x^{\prime})=\exp (-\theta\mid x-x^{\prime}\mid)\)
6-4-2. Gaussian Processes for Regression ( GPR )
observed target variable : \(t_{n}=y_{n}+\epsilon_{n}\).
\(p\left(t_{n} \mid y_{n}\right)=\mathcal{N}\left(t_{n} \mid y_{n}, \beta^{-1}\right)\).
\(p(\mathbf{t} \mid \mathbf{y})=\mathcal{N}\left(\mathbf{t} \mid \mathbf{y}, \beta^{-1} \mathbf{I}_{N}\right)\).
- precision of the noise : \(\beta\)
Marginal distribution
-
\(p(\mathbf{y})=\mathcal{N}(\mathbf{y} \mid \mathbf{0}, \mathbf{K})\).
-
\(\mathbf{K}\) : Gram matrix
In order to find the marginal distribution of \(p(\mathbf{t})\):
- need to integrate over \(y\)
- \(p(\mathbf{t})=\int p(\mathbf{t} \mid \mathbf{y}) p(\mathbf{y}) \mathrm{d} \mathbf{y}=\mathcal{N}(\mathbf{t} \mid \mathbf{0}, \mathbf{C})\).
- covariance matrix \(\mathbf{C}\) : \(C\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)+\beta^{-1} \delta_{n m}\)
widely used kernel for GPR : exponential of a quadratic form
\[k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=\theta_{0} \exp \left\{-\frac{\theta_{1}}{2}\left\|\mathbf{x}_{n}-\mathbf{x}_{m}\right\|^{2}\right\}+\theta_{2}+\theta_{3} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m}\]Predictive Distribution
- goal in regression : make prediction of target variables for “new inputs”
- require that we evaluate predictive distribution, \(p\left(t_{N+1} \mid \mathbf{t}_{N}\right)\)
How to find \(p\left(t_{N+1} \mid \mathbf{t}_{N}\right)\) ?
[step 1] joint distribution \(p\left(\mathbf{t}_{N+1}\right),\)
( where \(\mathbf{t}_{N+1}\) is \(\left(t_{1}, \ldots, t_{N}, t_{N+1}\right)^{\mathrm{T}}\) )
\(p\left(\mathbf{t}_{N+1}\right)=\mathcal{N}\left(\mathbf{t}_{N+1} \mid \mathbf{0}, \mathbf{C}_{N+1}\right)\). where \(\mathbf{C}_{N+1}=\left(\begin{array}{cc} \mathbf{C}_{N} & \mathbf{k} \\ \mathbf{k}^{\mathrm{T}} & c \end{array}\right)\)
- vector \(\mathrm{k}\) has elements \(k\left(\mathrm{x}_{n}, \mathrm{x}_{N+1}\right)\)
- \(c=k\left(\mathrm{x}_{N+1}, \mathrm{x}_{N+1}\right)+\beta^{-1}\).
[step 2] find \(p\left(t_{N+1} \mid \mathbf{t}_{N}\right)\)
Mean and covariance
- mean : \(m\left(\mathrm{x}_{N+1}\right) =\mathrm{k}^{\mathrm{T}} \mathrm{C}_{N}^{-1} \mathrm{t}\)
- covariance : \(\sigma^{2}\left(\mathrm{x}_{N+1}\right) =c-\mathrm{k}^{\mathrm{T}} \mathrm{C}_{N}^{-1} \mathrm{k}\)
Can also rewrite mean as…
\[m\left(\mathbf{x}_{N+1}\right)=\sum_{n=1}^{N} a_{n} k\left(\mathbf{x}_{n}, \mathbf{x}_{N+1}\right)\]- \(a_n\) : \(n^{th}\) component of \(\mathrm{C}_{N}^{-1} \mathrm{t}\).
- if \(k\left(\mathrm{x}_{n}, \mathrm{x}_{m}\right)\) depends only on $$ then we obtain an expansion in RBF
Computational Operation
-
GP : \(O(N^3)\).
- inversion of a matrix of size \(N \times N\)
-
Basis function model : \(O(M^3)\)
- inversion of a matrix of size \(M \times M\)
-
both needs “inversion of matrix”
-
at test time..
- GP : \(O(N^2)\)
- BF : \(O(M^2)\)
-
If \(M\)<\(N\), it is more efficient to work in the basis function
but, GP can consider covariance functions!
For large training datasets, direct application of GP is infeasible
\(\rightarrow\) approximations have been developed
6-4-3. Learning the hyperparameters
prediction of GP : depend on covariance function
-
instead of fixing covariance function (X)
-
use parametric family of functions \& infer parameters from data
( parameters : length scale of the correlation, precision of the noise … )
- \(\theta\) : hyperparamters of GP
-
simplest approach : maximize \(p(t\mid \theta)\)
-
\(\ln p(\mathbf{t} \mid \theta)=-\frac{1}{2} \ln \mid \mathbf{C}_{N}\mid-\frac{1}{2} \mathbf{t}^{\mathrm{T}} \mathbf{C}_{N}^{-1} \mathbf{t}-\frac{N}{2} \ln (2 \pi)\).
\(\frac{\partial}{\partial \theta_{i}} \ln p(\mathbf{t} \mid \boldsymbol{\theta})=-\frac{1}{2} \operatorname{Tr}\left(\mathbf{C}_{N}^{-1} \frac{\partial \mathbf{C}_{N}}{\partial \theta_{i}}\right)+\frac{1}{2} \mathbf{t}^{\mathrm{T}} \mathbf{C}_{N}^{-1} \frac{\partial \mathbf{C}_{N}}{\partial \theta_{i}} \mathbf{C}_{N}^{-1} \mathbf{t}\).
\(\rightarrow\) non-convex function!
In Bayesian treatment, exact marginalization is intractable
\(\rightarrow\) need approximation
6-4-4. Automatic relevance determination (ARD)
in 6-4-3, maximize likelihood to find hyperparameter!
this can be extended by “incorporating a separate parameter for each input variable”
Automatic relevance determination (ARD)
-
formulated in the framework of NN
-
example with 2d input \(\mathbf{x}=(x_1,x_2)\)
\(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\theta_{0} \exp \left\{-\frac{1}{2} \sum_{i=1}^{2} \eta_{i}\left(x_{i}-x_{i}^{\prime}\right)^{2}\right\}\).
Selecting important variables ( = discarding unimportant ones )
-
have \(\eta_i\) ( one for each input variable )
-
small \(\eta_i\) \(\rightarrow\) function becomes insensitive
-
It becomes possible to detect the input variables that have little effect on predictive distribution!
( since \(\eta_i\) will be small )
-
unimportant input (variables) will be discarded
ARD framework can be easily incorporated into the exponential-quadratic kernel
\[k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=\theta_{0} \exp \left\{-\frac{1}{2} \sum_{i=1}^{D} \eta_{i}\left(x_{n i}-x_{m i}\right)^{2}\right\}+\theta_{2}+\theta_{3} \sum_{i=1}^{D} x_{n i} x_{m i}\]6-4-5. Gaussian Process for Classification
probabilities must lie in the interval (0,1)
\(\rightarrow\) by transforming the output of GP using an appropriate nonlinear activation function
( = logistic sigmoid \(y=\sigma(a)\) )
model (Bernoulli distn) : \(p(t \mid a)=\sigma(a)^{t}(1-\sigma(a))^{1-t}\)
GP prior : \(p\left(\mathbf{a}_{N+1}\right)=\mathcal{N}\left(\mathbf{a}_{N+1} \mid \mathbf{0}, \mathbf{C}_{N+1}\right)\)
- \(C\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)+\nu \delta_{n m}\).
- \(k\left(\mathrm{x}_{n}, \mathrm{x}_{m}\right)\) is any positive semidefinite kernel function
- \(\nu\) is usually fixed in advance
Predictive distribution :
-
\(p\left(t_{N+1}\right.\left.=1 \mid \mathbf{t}_{N}\right)=\int p\left(t_{N+1}=1 \mid a_{N+1}\right) p\left(a_{N+1} \mid \mathbf{t}_{N}\right) \mathrm{d} a_{N+1}\).
where \(p\left(t_{N+1}\right.\left.=1 \mid a_{N+1}\right)=\sigma\left(a_{N+1}\right)\)
3 different approaches to obtain Gaussian Approximation
- (1) Variational Inference
- (2) Expectation Propagation
- (3) Laplace Approximation