( Skip the basic parts + not important contents )
5. Neural Networks
SVM ( Support Vector Machine ):
-
address NN first by defining “basis functions” ( later in CH7. )
then select a subset of these!
-
advantages : non-linear optimization, but objective function is convex
thus, optimization is straightforward
RVM ( Relevance Vector Machine )
- also choose a subset from a fixed set of basis functions
- difference with SVM
- sparser models
- probabilistic outputs
- non-convex optimization
NN
- alternative of those two : fix the number of basis function
- MLP ( Multi-Layer Perceptron )
5-1. Feed-forward Network Functions
linear combinations of fixed non-linear basis functions \(\phi_j(x)\)
\(y(\mathbf{x}, \mathbf{w})=f\left(\sum_{j=1}^{M} w_{j} \phi_{j}(\mathbf{x})\right)\).
- \(a_{j}=\sum_{i=1}^{D} w_{j i}^{(1)} x_{i}+w_{j 0}^{(1)}\).
- \(z_{j}=h\left(a_{j}\right)\).
- ex) logistic sigmoid function
-
\(y_{k}=\sigma\left(a_{k}\right)\), where \(\sigma(a)=\frac{1}{1+\exp (-a)}\)
5-2. Network Training
minimize error function
\[E(\mathbf{w})=\frac{1}{2} \sum_{n=1}^{N}\left\|\mathbf{y}\left(\mathbf{x}_{n}, \mathbf{w}\right)-\mathbf{t}_{n}\right\|^{2}\]However, we can provide a more “general” view with probabilistic interpretation
5-2-1. Regression
output :
-
not deterministic
-
stochastic!
\(p(t \mid \mathbf{x}, \mathbf{w})=\mathcal{N}\left(t \mid y(\mathbf{x}, \mathbf{w}), \beta^{-1}\right)\) —————— (1)
where \(\beta^{-1}\) is a precision of the Gaussian noise
likelihood function : \(p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta)=\prod_{n=1}^{N} p\left(t_{n} \mid \mathbf{x}_{n}, \mathbf{w}, \beta\right)\) ————- (2)
NLL (Negative Log Likelihood) : \(-\text{log}\;\;p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta)\)
by (1) and (2) … \(\frac{\beta}{2} \sum_{n=1}^{N}\left\{y\left(\mathbf{x}_{n}, \mathbf{w}\right)-t_{n}\right\}^{2}-\frac{N}{2} \ln \beta+\frac{N}{2} \ln (2 \pi)\)
Step 1) find \(\mathbf{w}_{ML}\)
Maximizing Likelihood : \(p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta)\)
= Minimizing SSE : \(E(\mathbf{w})=\frac{1}{2} \sum_{n=1}^{N}\left\{y\left(\mathbf{x}_{n}, \mathbf{w}\right)-t_{n}\right\}^{2}\)
Step 2) find \(\beta_{ML}\)
\[\frac{1}{\beta_{\mathrm{ML}}}=\frac{1}{N} \sum_{n=1}^{N}\left\{y\left(\mathrm{x}_{n}, \mathrm{w}_{\mathrm{ML}}\right)-t_{n}\right\}^{2}\]5-2-2. Classification (binary)
single output, with activation function =
logistic sigmoid : \(y=\sigma(a) \equiv \frac{1}{1+\exp (-a)}\)
likelihood : \(p(t \mid \mathbf{x}, \mathbf{w})=y(\mathbf{x}, \mathbf{w})^{t}\{1-y(\mathbf{x}, \mathbf{w})\}^{1-t}\)
NLL : \(-\text{log}\;\;p(\mathbf{t} \mid \mathbf{X}, \mathbf{w})\) = \(E(\mathbf{w})=-\sum_{n=1}^{N}\left\{t_{n} \ln y_{n}+\left(1-t_{n}\right) \ln \left(1-y_{n}\right)\right\}\)
5-2-3. Classification (multiclass)
multiple output, with activation function =
softmax : \(y_{k}(\mathbf{x}, \mathbf{w})=\frac{\exp \left(a_{k}(\mathbf{x}, \mathbf{w})\right)}{\sum_{j} \exp \left(a_{j}(\mathbf{x}, \mathbf{w})\right)}\)
likelihood : \(p(t \mid \mathbf{x}, \mathbf{w})=y(\mathbf{x}, \mathbf{w})^{t}\{1-y(\mathbf{x}, \mathbf{w})\}^{1-t}\)
NLL : \(-\text{log}\;\;p(\mathbf{t} \mid \mathbf{X}, \mathbf{w})\) = \(E(\mathbf{w})=-\sum_{n=1}^{N} \sum_{k=1}^{K} t_{k n} \ln y_{k}\left(\mathbf{x}_{n}, \mathbf{w}\right)\)
5-2-4. Parameter Optimization
\[\mathbf{w}^{(\tau+1)}=\mathbf{w}^{(\tau)}+\Delta \mathbf{w}^{(\tau)}\]5-2-5. Local quadratic approximation
Taylor expansion of \(E(\mathbf{w})\) around some point \(\widehat{\mathbf{w}}\) :
- \[E(\mathbf{w}) \simeq E(\widehat{\mathbf{w}})+(\mathbf{w}-\widehat{\mathbf{w}})^{\mathrm{T}} \mathbf{b}+\frac{1}{2}(\mathbf{w}-\widehat{\mathbf{w}})^{\mathrm{T}} \mathbf{H}(\mathbf{w}-\widehat{\mathbf{w}})\]
- gradient of \(E\) , evaluated at \(\widehat{\mathbf{w}}\)
- Hessian Matrix
- element : \((\mathbf{H})_{i j} \equiv \frac{\partial E}{\partial w_{i} \partial w_{j}}\mid_{\mathbf{w}=\widehat{\mathbf{w}}}\)
- Local approximation to the gradient :
5-2-6. Gradient Descent optimization
\[\mathbf{w}^{(\tau+1)}=\mathbf{w}^{(\tau)}-\eta \nabla E\left(\mathbf{w}^{(\tau)}\right)\]5-3. The Hessian Matrix
Back-prop can also be used to evaluate the “second derivative of the error”
= \(\frac{\partial^{2} E}{\partial w_{j i} \partial w_{l k}}\)
5-3-1. Diagonal Approximation
sometimes, more interest in “Inverse of the Hessian”
\(\rightarrow\) therefore, interest in “diagonal approximation” to the Hessian
( = replace off-diagonal elements with zero(0), so easy calculation with inverse! )
[Step 1]
diagonal elements of Hessian : \(\frac{\partial^{2} E_{n}}{\partial w_{j i}^{2}}=\frac{\partial^{2} E_{n}}{\partial a_{j}^{2}} z_{i}^{2}\)——–(1)
( \(\because\) feed-forward : \(a_{j}=\sum_{i} w_{j i} z_{i}\))
[Step 2]
apply chain rule in (1)
\(\frac{\partial^{2} E_{n}}{\partial a_{j}^{2}}=h^{\prime}\left(a_{j}\right)^{2} \sum_{k} \sum_{k^{\prime}} w_{k j} w_{k^{\prime} j} \frac{\partial^{2} E_{n}}{\partial a_{k} \partial a_{k^{\prime}}}+h^{\prime \prime}\left(a_{j}\right) \sum_{k} w_{k j} \frac{\partial E^{n}}{\partial a_{k}}\) —— (2)
( \(\because\) \(a_{j}=\sum_{i} w_{j i} z_{i}\) and \(z_j = h(a_j)\) )
[Step 3]
neglect off-diagonal elements in (2)
\[\frac{\partial^{2} E_{n}}{\partial a_{j}^{2}}=h^{\prime}\left(a_{j}\right)^{2} \sum_{k} w_{k j}^{2} \frac{\partial^{2} E_{n}}{\partial a_{k}^{2}}+h^{\prime \prime}\left(a_{j}\right) \sum_{k} w_{k j} \frac{\partial E_{n}}{\partial a_{k}}\]Result
- before (full Hessian) : \(O\left(W^{2}\right)\)
- after (approximation) : \(O(W)\)
( but in practice, Hessian is strongly non-diagonal )
5-3-2. Outer product approximation
(1) Regression
-
error function (SSE) : \(E=\frac{1}{2} \sum_{n=1}^{N}\left(y_{n}-t_{n}\right)^{2}\)
-
Hessian matrix : \(\mathbf{H}=\nabla \nabla E=\sum_{n=1}^{N} \nabla y_{n} \nabla y_{n}+\sum_{n=1}^{N}\left(y_{n}-t_{n}\right) \nabla \nabla y_{n}\)
-
if \(y_n\) and \(t_n\) are close, second term ( = \(\sum_{n=1}^{N}\left(y_{n}-t_{n}\right) \nabla \nabla y_{n}\) ) is small
\(\rightarrow\) neglect the second term
\(\rightarrow\) \(\mathbf{H} \simeq \sum^{N}_{n=1} \mathbf{b}_{n} \mathbf{b}_{n}^{\mathrm{T}}\) , where \(\mathbf{b}_{n}=\nabla y_{n}=\nabla a_{n}\)
( called “Levenberg-Marquardt approximation”, or “outer product approximation”)
(2) Classification
-
error function (cross-entropy)
-
Hessian matrix
\(\rightarrow\) neglect the second term
\(\rightarrow\) \(\mathbf{H} \simeq \sum_{n=1}^{N} y_{n}\left(1-y_{n}\right) \mathbf{b}_{n} \mathbf{b}_{n}^{\mathrm{T}}\), where \(\mathbf{b}_{n}=\nabla y_{n}=\nabla a_{n}\)
5-3-3. Inverse Hessian
use outer-product approximation, to efficiently calculate inverse of Hessian!
first, outer product approximation : \(\mathbf{H}_{N}=\sum^{N} \mathbf{b}_{n} \mathbf{b}_{n}^{\mathrm{T}}\)
first \(L\) data points, add \(L+1^{th}\) data : \(\mathbf{H}_{L+1}=\mathbf{H}_{L}+\mathbf{b}_{L+1} \mathbf{b}_{L+1}^{\mathrm{T}}\)
then, its inverse will be \(\mathbf{H}_{L+1}^{-1}=\mathbf{H}_{L}^{-1}-\frac{\mathbf{H}_{L}^{-1} \mathbf{b}_{L+1} \mathbf{b}_{L+1}^{\mathrm{T}} \mathbf{H}_{L}^{-1}}{1+\mathbf{b}_{L+1}^{\mathrm{T}} \mathbf{H}_{L}^{-1} \mathbf{b}_{L+1}}\)
5-3-4. Finite differences
we can find second derivatives by using finite differences!
perturb each possible pair of weights
- \[\begin{array}{l} \frac{\partial^{2} E}{\partial w_{j i} \partial w_{l k}}=\frac{1}{4 \epsilon^{2}}\left\{E\left(w_{j i}+\epsilon, w_{l k}+\epsilon\right)-E\left(w_{j i}+\epsilon, w_{l k}-\epsilon\right)\right. \\ \left.\quad-E\left(w_{j i}-\epsilon, w_{l k}+\epsilon\right)+E\left(w_{j i}-\epsilon, w_{l k}-\epsilon\right)\right\}+O\left(\epsilon^{2}\right) \end{array}\]
-
\(W^2\) elements in Hessian matrix
each element requiring 4 forward propagation, each needing \(O(W)\)
- result : \(O(W^3)\)
can do it more efficient
-
by applying central differences to the first derivatives
- \[\frac{\partial^{2} E}{\partial w_{j i} \partial w_{l k}}=\frac{1}{2 \epsilon}\left\{\frac{\partial E}{\partial w_{j i}}\left(w_{l k}+\epsilon\right)-\frac{\partial E}{\partial w_{j i}}\left(w_{l k}-\epsilon\right)\right\}+O\left(\epsilon^{2}\right)\]
-
\(W^2\) elements in Hessian matrix
but only \(W\) weights are to be perturbed ( first derivative is already calculated )
gradients can be evaluated in with \(O(W)\) steps
- result : \(O(W^2)\)
5-4. Regularization in Neural Networks
weight decay : \(\widetilde{E}(\mathbf{w})=E(\mathbf{w})+\frac{\lambda}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}\)
5-4-1. Consistent Gaussian priors
Limitation of simple weight decay :
- inconsistent with scaling!
Transformation to input variable
-
\(x_{i} \rightarrow \widetilde{x}_{i}=a x_{i}+b\).
-
weight and bias :
\(\begin{aligned} w_{j i} \rightarrow \widetilde{w}_{j i} &=\frac{1}{a} w_{j i} \\ w_{j 0} \rightarrow \widetilde{w}_{j 0} &=w_{j 0}-\frac{b}{a} \sum_{i} w_{j i} \end{aligned}\),
Transformation to output variable
-
\(y_{k} \rightarrow \widetilde{y}_{k}=c y_{k}+d\).
-
weight and bias :
\(\begin{aligned} w_{k j} \rightarrow \widetilde{w}_{k j} &=c w_{k j} \\ w_{k 0} \rightarrow \widetilde{w}_{k 0} &=c w_{k 0}+d \end{aligned}\).
Therefore, we want regularize which is “INVARIANT” under linear transformation!
That is …
\(\frac{\lambda_{1}}{2} \sum_{w \in \mathcal{W}_{1}} w^{2}+\frac{\lambda_{2}}{2} \sum_{w \in \mathcal{W}_{2}} w^{2}\).
- \(\lambda_{1} \rightarrow a^{1 / 2} \lambda_{1}\) and \(\lambda_{2} \rightarrow c^{-1 / 2} \lambda_{2}\)
It corresponds to a prior of…
\[p\left(\mathbf{w} \mid \alpha_{1}, \alpha_{2}\right) \propto \exp \left(-\frac{\alpha_{1}}{2} \sum_{w \in \mathcal{W}_{1}} w^{2}-\frac{\alpha_{2}}{2} \sum_{w \in \mathcal{W}_{2}} w^{2}\right)\]( more generally : \(p(\mathbf{w}) \propto \exp \left(-\frac{1}{2} \sum_{k} \alpha_{k}\|\mathbf{w}\|_{k}^{2}\right)\) )
If we choose the groups to corresponds to the sets of weights associated with each of the input units,
and optimize the marginal likelihood w.r.t the corresponding parameters \(\alpha_k\) :
\(\rightarrow\) ARD ( Automatic Relevance Determination ) … Chapter 7
5-4-2. Early stopping
5-5. Mixture Density networks
practical problems : “non-Gaussian” distributions
Mixture Density Networks :
use mixture model for \(p(t\mid x)\), where both
- mixing coefficients
- component densities
are flexible function of input vector \(x\)
\[p(\mathbf{t} \mid \mathbf{x})=\sum_{k=1}^{K} \pi_{k}(\mathbf{x}) \mathcal{N}\left(\mathbf{t} \mid \boldsymbol{\mu}_{k}(\mathbf{x}), \sigma_{k}^{2}(\mathbf{x})\right)\]Various parameters in the mixture model
- mixture coefficients : \(\pi_{k}(\mathrm{x})=\frac{\exp \left(a_{k}^{\pi}\right)}{\sum_{l=1}^{K} \exp \left(a_{l}^{\pi}\right)}\)
- means : \(\mu_{k j}(\mathbf{x})=a_{k j}^{\mu}\)
- standard deviation : \(\sigma_{k}(\mathrm{x})=\exp \left(a_{k}^{\sigma}\right)\)
( all are governed by the NN that takes \(x\) as an input )
Same function is used to predict the parameters!
( do not calculate “predicted \(y\)” directly, instead predict the “parameters (\(\pi\) and \(\mu\) and \(\sigma\))” )
Error function ( = NLL )
\[E(\mathbf{w})=-\sum_{n=1}^{N} \ln \left\{\sum_{k=1}^{k} \pi_{k}\left(\mathbf{x}_{n}, \mathbf{w}\right) \mathcal{N}\left(\mathbf{t}_{n} \mid \boldsymbol{\mu}_{k}\left(\mathbf{x}_{n}, \mathbf{w}\right), \sigma_{k}^{2}\left(\mathbf{x}_{n}, \mathbf{w}\right)\right)\right\}\]view mixing coefficient \(\pi_k(x)\) as \(x\)-dependent prior probabilities!
then, the posterior is :
\(\gamma_{k}(\mathbf{t} \mid \mathbf{x})=\frac{\pi_{k} \mathcal{N}_{n k}}{\sum_{l=1}^{K} \pi_{l} \mathcal{N}_{n l}}\) , where \(\mathcal{N}_{n k}\) denotes \(\mathcal{N}\left(\mathbf{t}_{n} \mid \boldsymbol{\mu}_{k}\left(\mathbf{x}_{n}\right), \sigma_{k}^{2}\left(\mathbf{x}_{n}\right)\right)\)
Derivatives w.r.t output activations, governing the
-
1) mixture coefficients : \(a_{k}^{\pi}\)
\(\rightarrow\) \(\frac{\partial E_{n}}{\partial a_{k}^{\pi}}=\pi_{k}-\gamma_{k}\)
-
2) component means : \(a_{kl}^{\mu}\)
\(\rightarrow\) \(\frac{\partial E_{n}}{\partial a_{k l}^{\mu}}=\gamma_{k}\left\{\frac{\mu_{k l}-t_{l}}{\sigma_{k}^{2}}\right\}\)
-
3) component variances : \(a_{k}^{\sigma}\)
\(\rightarrow\)\frac{\partial E_{n}}{\partial a_{k}^{\sigma}}=-\gamma_{k}\left{\frac{\left|\mathbf{t}-\boldsymbol{\mu}{k}\right|^{2}}{\sigma{k}^{3}}-\frac{1}{\sigma_{k}}\right}$$
Conditional Mean and Variance
-
Conditional Mean : \(\mathbb{E}[\mathbf{t} \mid \mathbf{x}]=\int \mathbf{t} p(\mathbf{t} \mid \mathbf{x}) \mathrm{d} \mathbf{t}=\sum_{k=1}^{K} \pi_{k}(\mathbf{x}) \boldsymbol{\mu}_{k}(\mathbf{x})\)
-
Conditional Variance :
\[\begin{aligned} s^{2}(\mathrm{x}) &=\mathbb{E}\left[\|\mathrm{t}-\mathbb{E}[\mathrm{t} \mid \mathrm{x}]\|^{2} \mid \mathrm{x}\right] \\ &=\sum_{k=1}^{K} \pi_{k}(\mathrm{x})\left\{\sigma_{k}^{2}(\mathrm{x})+\left\|\mu_{k}(\mathrm{x})-\sum_{l=1}^{K} \pi_{l}(\mathrm{x}) \mu_{l}(\mathrm{x})\right\|^{2}\right\} \end{aligned}\]
5-6. Bayesian Neural Network
( until now, we have used MLE … from now on, MAP )
Regularzied ML = MAP
but in Bayesian approach, we need to “marginalize” over the parameters
Unlike linear regression, in multi-layerd NN,
exact Bayesian treatment can not be found! \(\rightarrow\) Variational Inference ( Ch 10 )
Variational Inference
- 1) factorized Gaussian approximation to the posterior ( Hinton and van Camp, 1993 )
- 2) ….. full covariance Gaussian ( Barber and Bishop, 1998 )
- 3) Laplace approximation ( Mackay, 1992 )
Laplace Approximation
- (1) approximate the posterior by Gaussian, centered at a mode of the true psoterior
- (2) assume covariance of the Gaussian is so small, that network function is approximately linear!
5-6-1. Posterior parameter Distribution
Posterior \(\propto\) Prior \(\times\) Likelihood
-
Prior : \(p(\mathbf{w} \mid \alpha)=\mathcal{N}\left(\mathbf{w} \mid \mathbf{0}, \alpha^{-1} \mathbf{I}\right)\)
-
Likelihood ( = NN model ) : \(p(\mathcal{D} \mid \mathbf{w}, \beta)=\prod_{n=1}^{N} \mathcal{N}\left(t_{n} \mid y\left(\mathbf{x}_{n}, \mathbf{w}\right), \beta^{-1}\right)\)
Laplace Approximation
-
[step 1] find a (local) maximum of the posterior ( = mode )
\(\rightarrow\) maximize the posterior! ( = \(\mathbf{w}_{MAP}\))
\(\rightarrow\) \(\ln p(\mathbf{w} \mid \mathcal{D})=-\frac{\alpha}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}-\frac{\beta}{2} \sum_{n=1}^{N}\left\{y\left(\mathbf{x}_{n}, \mathbf{w}\right)-t_{n}\right\}^{2}+\mathrm{const}\)
( assume \(\alpha\) and \(\beta\) are fixed )
-
[step 2] build a local Gaussian approximation by evaluating the matrix of second derivatives
\(\rightarrow\) \(\mathbf{A}=-\nabla \nabla \ln p(\mathbf{w} \mid \mathcal{D}, \alpha, \beta)=\alpha \mathbf{I}+\beta \mathbf{H}\)
- \(H\): Hessian matrix, comprising the second derivatives
-
Result ( from step 1 + step 2 ) :
-
posterior : \(q(\mathbf{w} \mid \mathcal{D})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{w}_{\mathrm{MAP}}, \mathbf{A}^{-1}\right)\)
-
predictive distribution : \(p(t \mid \mathbf{x}, \mathcal{D})=\int p(t \mid \mathbf{x}, \mathbf{w}) q(\mathbf{w} \mid \mathcal{D}) \mathrm{d} \mathbf{w}\)
( this integration is still intractable…)
use Taylor Series expansion!
-
Taylor series expansion of network function around \(\mathbf{w}_{MAP}\)
( retain only linear terms! )
\[y(\mathbf{x}, \mathbf{w}) \simeq y\left(\mathbf{x}, \mathbf{w}_{\mathrm{MAP}}\right)+\mathbf{g}^{\mathbf{T}}\left(\mathbf{w}-\mathbf{w}_{\mathrm{MAP}}\right)\]- where \(\mathbf{g}=\nabla_{\mathbf{w}} y(\mathbf{x}, \mathbf{w})\mid_{\mathbf{w}=\mathbf{w}_{\mathrm{MAP}}}\)
With the approximation above…
we have “Gaussian” for \(p(w)\) and “Gaussian” for \(p(t \mid w)\)
\(\therefore p(t\mid x, \mathbf{w}, \beta) \simeq \mathcal{N}\left(t \mid y\left(\mathbf{x}, \mathbf{w}_{\mathrm{MAP}}\right)+\mathbf{g}^{\mathbf{T}}\left(\mathbf{w}-\mathbf{w}_{\mathrm{MAP}}\right), \beta^{-1}\right)\) —– (1)
if we marginalize out (1) w.r.t \(w\)
\[p(t \mid \mathbf{x}, \mathcal{D}, \alpha, \beta)=\mathcal{N}\left(t \mid y\left(\mathbf{x}, \mathbf{w}_{\mathrm{MAP}}\right), \sigma^{2}(\mathbf{x})\right)\]- where \(\sigma^{2}(\mathrm{x})=\beta^{-1}+\mathrm{g}^{\mathrm{T}} \mathrm{A}^{-1} \mathrm{~g}\)
- term 1) intrinsic noise on the target variable
- term 2) uncertainty in the interpolant due to the uncertainty in the model parameter
5-6-2. Hyperparameter optimization
hyperparmeter : \(\alpha\) and \(\beta\)
In order to compare different models, we need to evaluate the model evidence!
Marginal Likelihood (=Evidence) for the hyperparameters :
-
integrate over the weights!
\[p(\mathcal{D} \mid \alpha, \beta)=\int p(\mathcal{D} \mid \mathbf{w}, \beta) p(\mathbf{w} \mid \alpha) \mathrm{d} \mathbf{w}\] -
use Laplace Approximation ( check Ch 4.135 )
\[\ln p(\mathcal{D} \mid \alpha, \beta) \simeq-E\left(\mathbf{w}_{\mathrm{MAP}}\right)-\frac{1}{2} \ln |\mathbf{A}|+\frac{W}{2} \ln \alpha+\frac{N}{2} \ln \beta-\frac{N}{2} \ln (2 \pi)\]where \(E\left(\mathbf{w}_{\mathrm{MAP}}\right)=\frac{\beta}{2} \sum_{n=1}^{N}\left\{y\left(\mathbf{x}_{n}, \mathbf{w}_{\mathrm{MAP}}\right)-t_{n}\right\}^{2}+\frac{\alpha}{2} \mathbf{w}_{\mathrm{MAP}}^{\mathrm{T}} \mathbf{W}_{\mathrm{MAP}}\)
Make point estimates for \(\alpha\) and \(\beta\), by maximizing \(\ln p(\mathcal{D} \mid \alpha, \beta)\)
- \(\beta \mathbf{H} \mathbf{u}_{i}=\lambda_{i} \mathbf{u}_{i}\) ( \(H\) : Hessian matrix, comprising of 2nd derivative of SSE, at \(w = w_{MAP}\) )
- \(\alpha=\frac{\gamma}{\mathrm{w}_{\mathrm{MAP}}^{\mathrm{T}} \mathrm{w}_{\mathrm{MAP}}}\).
- effective number of parameters : \(\gamma=\sum_{i=1}^{W} \frac{\lambda_{i}}{\alpha+\lambda_{i}}\)
\(\therefore\) maximizing the evidence w.r.t \(\beta\):
\[\frac{1}{\beta}=\frac{1}{N-\gamma} \sum_{n=1}^{N}\left\{y\left(\mathbf{x}_{n}, \mathbf{w}_{\mathrm{MAP}}\right)-t_{n}\right\}^{2}\]alternate between
- 1) updating the posterior distribution
- 2) re-estimation of the hyperparameters \(\alpha,\beta\)
5-6-3. Bayesian Neural Network for Classification
so far, we have used “Laplace Approximation” for Bayesian NN for regression.
How about classification?
example with ‘two-class classification’ ( with single logistic sigmoid )
- log likelihood : \(\ln p(\mathcal{D} \mid \mathbf{w})=\sum_{n}=1^{N}\left\{t_{n} \ln y_{n}+\left(1-t_{n}\right) \ln \left(1-y_{n}\right)\right\}\)
- no hyperparamter \(\beta\) ( \(\because\) data points are assumed to be correctly labeled )
[Step 1] initialize \(\alpha\), and find \(\mathbf{w}_{\text{MAP}}\)
-
by maximzing log posterior distribution
-
= by minimizing \(E(\mathbf{w})=-\ln p(\mathcal{D} \mid \mathbf{w})+\frac{\alpha}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}\)
[Step 2] find Hessian matrix \(\mathbf{H}\)
- second derivative of negative log posterior
Result ( from step 1 + step 2 ) :
-
posterior : \(q(\mathbf{w} \mid \mathcal{D})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{w}_{\mathrm{MAP}}, \mathbf{A}^{-1}\right)\)
where \(\mathbf{A}=-\nabla \nabla \ln p(\mathbf{w} \mid \mathcal{D}, \alpha, \beta)=\alpha \mathbf{I}+\beta \mathbf{H}\)
[Step 3] optimize hyperparameter \(\alpha\)
- maximize the marginal likelihood, \(\ln p(\mathcal{D} \mid \alpha) \simeq-E\left(\mathbf{w}_{\mathrm{MAP}}\right)-\frac{1}{2} \ln \mid \mathbf{A}\mid +\frac{W}{2} \ln \alpha+\mathrm{const}\)
- \(E\left(\mathrm{w}_{\mathrm{MAP}}\right)=-\sum_{n=1}^{N}\left\{t_{n} \ln y_{n}+\left(1-t_{n}\right) \ln \left(1-y_{n}\right)\right\}+\frac{\alpha}{2} \mathrm{w}_{\mathrm{MAP}}^{\mathrm{T}} \mathrm{w}_{\mathrm{MAP}}\).
- \(y_{n} \equiv y\left(\mathrm{x}_{n}, \mathrm{w}_{\mathrm{MAP}}\right)\).
- that is, \(\alpha=\frac{\gamma}{\mathrm{w}_{\mathrm{MAP}}^{\mathrm{T}} \mathrm{w}_{\mathrm{MAP}}}\)
[Step 4] find predictive distribution ( refer to Ch4.5.2 )
-
\(p(t \mid \mathbf{x}, \mathcal{D})=\int p(t \mid \mathbf{x}, \mathbf{w}) q(\mathbf{w} \mid \mathcal{D}) \mathrm{d} \mathbf{w}\) is intractable
-
By Laplace approximation : \(p(t \mid \mathrm{x}, \mathcal{D}) \simeq p\left(t \mid \mathrm{x}, \mathrm{w}_{\mathrm{MAP}}\right)\)
-
improve it by considering “variance of the posterior”
-
linear approximation for the output
\[a(\mathbf{x}, \mathbf{w}) \simeq a_{\mathrm{MAP}}(\mathbf{x})+\mathbf{b}^{\mathrm{T}}\left(\mathbf{w}-\mathbf{w}_{\mathrm{MAP}}\right)\]where \(a_{\mathrm{MAP}}(\mathrm{x})=a\left(\mathrm{x}, \mathrm{w}_{\mathrm{MAP}}\right)\) and \(\mathrm{b} \equiv \nabla a\left(\mathrm{x}, \mathrm{w}_{\mathrm{MAP}}\right)\)
-
- mean : \(a_{\mathrm{MAP}} \equiv a\left(\mathrm{x}, \mathrm{w}_{\mathrm{MAP}}\right),\)
- variance : \(\sigma_{a}^{2}(\mathrm{x})=\mathrm{b}^{\mathrm{T}}(\mathrm{x}) \mathrm{A}^{-1} \mathrm{~b}(\mathrm{x})\)
[Result] approximate predictive distribution :
To obtain predictive distribution, marginalize over \(a\)
\[p(t=1 \mid \mathrm{x}, \mathcal{D})=\int \sigma(a) p(a \mid \mathrm{x}, \mathcal{D}) \mathrm{d} a\]after approximation using probit function…
\[p(t=1 \mid \mathbf{x}, \mathcal{D})=\sigma\left(\kappa\left(\sigma_{a}^{2}\right) \mathbf{b}^{\mathrm{T}} \mathbf{w}_{\mathrm{MAP}}\right)\]