( Skip the basic parts + not important contents )

5. Neural Networks

SVM ( Support Vector Machine ):

address NN first by defining “basis functions” ( later in CH7. )

then select a subset of these!
advantages : non-linear optimization, but objective function is convex

thus, optimization is straightforward

RVM ( Relevance Vector Machine )

also choose a subset from a fixed set of basis functions
difference with SVM
- sparser models
- probabilistic outputs
- non-convex optimization

alternative of those two : fix the number of basis function
MLP ( Multi-Layer Perceptron )

5-1. Feed-forward Network Functions

linear combinations of fixed non-linear basis functions $\phi_j(x)$

$y(\mathbf{x}, \mathbf{w})=f\left(\sum_{j=1}^{M} w_{j} \phi_{j}(\mathbf{x})\right)$.

$a_{j}=\sum_{i=1}^{D} w_{j i}^{(1)} x_{i}+w_{j 0}^{(1)}$.
$z_{j}=h\left(a_{j}\right)$.

ex) logistic sigmoid function: $y_{k}=\sigma\left(a_{k}\right)$, where $\sigma(a)=\frac{1}{1+\exp (-a)}$

5-2. Network Training

minimize error function

\[E(\mathbf{w})=\frac{1}{2} \sum_{n=1}^{N}\left\|\mathbf{y}\left(\mathbf{x}_{n}, \mathbf{w}\right)-\mathbf{t}_{n}\right\|^{2}\]

However, we can provide a more “general” view with probabilistic interpretation

5-2-1. Regression

output :

not deterministic
stochastic!

$p(t \mid \mathbf{x}, \mathbf{w})=\mathcal{N}\left(t \mid y(\mathbf{x}, \mathbf{w}), \beta^{-1}\right)$ —————— (1)

where $\beta^{-1}$ is a precision of the Gaussian noise

likelihood function : $p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta)=\prod_{n=1}^{N} p\left(t_{n} \mid \mathbf{x}_{n}, \mathbf{w}, \beta\right)$ ————- (2)

NLL (Negative Log Likelihood) : $-\text{log}\;\;p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta)$

by (1) and (2) … $\frac{\beta}{2} \sum_{n=1}^{N}\left\{y\left(\mathbf{x}_{n}, \mathbf{w}\right)-t_{n}\right\}^{2}-\frac{N}{2} \ln \beta+\frac{N}{2} \ln (2 \pi)$

Step 1) find $\mathbf{w}_{ML}$

Maximizing Likelihood : $p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta)$

= Minimizing SSE : $E(\mathbf{w})=\frac{1}{2} \sum_{n=1}^{N}\left\{y\left(\mathbf{x}_{n}, \mathbf{w}\right)-t_{n}\right\}^{2}$

Step 2) find $\beta_{ML}$

\[\frac{1}{\beta_{\mathrm{ML}}}=\frac{1}{N} \sum_{n=1}^{N}\left\{y\left(\mathrm{x}_{n}, \mathrm{w}_{\mathrm{ML}}\right)-t_{n}\right\}^{2}\]

5-2-2. Classification (binary)

single output, with activation function =

logistic sigmoid : $y=\sigma(a) \equiv \frac{1}{1+\exp (-a)}$

likelihood : $p(t \mid \mathbf{x}, \mathbf{w})=y(\mathbf{x}, \mathbf{w})^{t}\{1-y(\mathbf{x}, \mathbf{w})\}^{1-t}$

NLL : $-\text{log}\;\;p(\mathbf{t} \mid \mathbf{X}, \mathbf{w})$ = $E(\mathbf{w})=-\sum_{n=1}^{N}\left\{t_{n} \ln y_{n}+\left(1-t_{n}\right) \ln \left(1-y_{n}\right)\right\}$

5-2-3. Classification (multiclass)

multiple output, with activation function =

softmax : $y_{k}(\mathbf{x}, \mathbf{w})=\frac{\exp \left(a_{k}(\mathbf{x}, \mathbf{w})\right)}{\sum_{j} \exp \left(a_{j}(\mathbf{x}, \mathbf{w})\right)}$

likelihood : $p(t \mid \mathbf{x}, \mathbf{w})=y(\mathbf{x}, \mathbf{w})^{t}\{1-y(\mathbf{x}, \mathbf{w})\}^{1-t}$

NLL : $-\text{log}\;\;p(\mathbf{t} \mid \mathbf{X}, \mathbf{w})$ = $E(\mathbf{w})=-\sum_{n=1}^{N} \sum_{k=1}^{K} t_{k n} \ln y_{k}\left(\mathbf{x}_{n}, \mathbf{w}\right)$

5-2-4. Parameter Optimization

\[\mathbf{w}^{(\tau+1)}=\mathbf{w}^{(\tau)}+\Delta \mathbf{w}^{(\tau)}\]

5-2-5. Local quadratic approximation

Taylor expansion of $E(\mathbf{w})$ around some point $\widehat{\mathbf{w}}$ :

\[E(\mathbf{w}) \simeq E(\widehat{\mathbf{w}})+(\mathbf{w}-\widehat{\mathbf{w}})^{\mathrm{T}} \mathbf{b}+\frac{1}{2}(\mathbf{w}-\widehat{\mathbf{w}})^{\mathrm{T}} \mathbf{H}(\mathbf{w}-\widehat{\mathbf{w}})\]

\[\left.\mathbf{b} \equiv \nabla E\right|_{\mathbf{w}=\widehat{\mathbf{w}}}\]

gradient of $E$ , evaluated at $\widehat{\mathbf{w}}$

\[\mathbf{H}=\nabla \nabla E\]

Hessian Matrix
element : $(\mathbf{H})_{i j} \equiv \frac{\partial E}{\partial w_{i} \partial w_{j}}\mid_{\mathbf{w}=\widehat{\mathbf{w}}}$

\[\nabla E \simeq \mathbf{b}+\mathbf{H}(\mathbf{w}-\widehat{\mathbf{w}})\]

Local approximation to the gradient :

5-2-6. Gradient Descent optimization

\[\mathbf{w}^{(\tau+1)}=\mathbf{w}^{(\tau)}-\eta \nabla E\left(\mathbf{w}^{(\tau)}\right)\]

5-3. The Hessian Matrix

Back-prop can also be used to evaluate the “second derivative of the error”

= $\frac{\partial^{2} E}{\partial w_{j i} \partial w_{l k}}$

5-3-1. Diagonal Approximation

sometimes, more interest in “Inverse of the Hessian”

$\rightarrow$ therefore, interest in “diagonal approximation” to the Hessian

( = replace off-diagonal elements with zero(0), so easy calculation with inverse! )

[Step 1]

diagonal elements of Hessian : $\frac{\partial^{2} E_{n}}{\partial w_{j i}^{2}}=\frac{\partial^{2} E_{n}}{\partial a_{j}^{2}} z_{i}^{2}$——–(1)

( $\because$ feed-forward : $a_{j}=\sum_{i} w_{j i} z_{i}$)

[Step 2]

apply chain rule in (1)

$\frac{\partial^{2} E_{n}}{\partial a_{j}^{2}}=h^{\prime}\left(a_{j}\right)^{2} \sum_{k} \sum_{k^{\prime}} w_{k j} w_{k^{\prime} j} \frac{\partial^{2} E_{n}}{\partial a_{k} \partial a_{k^{\prime}}}+h^{\prime \prime}\left(a_{j}\right) \sum_{k} w_{k j} \frac{\partial E^{n}}{\partial a_{k}}$ —— (2)

( $\because$ $a_{j}=\sum_{i} w_{j i} z_{i}$ and $z_j = h(a_j)$ )

[Step 3]

neglect off-diagonal elements in (2)

\[\frac{\partial^{2} E_{n}}{\partial a_{j}^{2}}=h^{\prime}\left(a_{j}\right)^{2} \sum_{k} w_{k j}^{2} \frac{\partial^{2} E_{n}}{\partial a_{k}^{2}}+h^{\prime \prime}\left(a_{j}\right) \sum_{k} w_{k j} \frac{\partial E_{n}}{\partial a_{k}}\]

Result

before (full Hessian) : $O\left(W^{2}\right)$
after (approximation) : $O(W)$

( but in practice, Hessian is strongly non-diagonal )

5-3-2. Outer product approximation

(1) Regression

error function (SSE) : $E=\frac{1}{2} \sum_{n=1}^{N}\left(y_{n}-t_{n}\right)^{2}$
Hessian matrix : $\mathbf{H}=\nabla \nabla E=\sum_{n=1}^{N} \nabla y_{n} \nabla y_{n}+\sum_{n=1}^{N}\left(y_{n}-t_{n}\right) \nabla \nabla y_{n}$
if $y_n$ and $t_n$ are close, second term ( = $\sum_{n=1}^{N}\left(y_{n}-t_{n}\right) \nabla \nabla y_{n}$ ) is small

$\rightarrow$ neglect the second term

$\rightarrow$ $\mathbf{H} \simeq \sum^{N}_{n=1} \mathbf{b}_{n} \mathbf{b}_{n}^{\mathrm{T}}$ , where $\mathbf{b}_{n}=\nabla y_{n}=\nabla a_{n}$

( called “Levenberg-Marquardt approximation”, or “outer product approximation”)

(2) Classification

error function (cross-entropy)
Hessian matrix

$\rightarrow$ neglect the second term

$\rightarrow$ $\mathbf{H} \simeq \sum_{n=1}^{N} y_{n}\left(1-y_{n}\right) \mathbf{b}_{n} \mathbf{b}_{n}^{\mathrm{T}}$, where $\mathbf{b}_{n}=\nabla y_{n}=\nabla a_{n}$

5-3-3. Inverse Hessian

use outer-product approximation, to efficiently calculate inverse of Hessian!

first, outer product approximation : $\mathbf{H}_{N}=\sum^{N} \mathbf{b}_{n} \mathbf{b}_{n}^{\mathrm{T}}$

first $L$ data points, add $L+1^{th}$ data : $\mathbf{H}_{L+1}=\mathbf{H}_{L}+\mathbf{b}_{L+1} \mathbf{b}_{L+1}^{\mathrm{T}}$

then, its inverse will be $\mathbf{H}_{L+1}^{-1}=\mathbf{H}_{L}^{-1}-\frac{\mathbf{H}_{L}^{-1} \mathbf{b}_{L+1} \mathbf{b}_{L+1}^{\mathrm{T}} \mathbf{H}_{L}^{-1}}{1+\mathbf{b}_{L+1}^{\mathrm{T}} \mathbf{H}_{L}^{-1} \mathbf{b}_{L+1}}$

5-3-4. Finite differences

we can find second derivatives by using finite differences!

perturb each possible pair of weights

\[\begin{array}{l} \frac{\partial^{2} E}{\partial w_{j i} \partial w_{l k}}=\frac{1}{4 \epsilon^{2}}\left\{E\left(w_{j i}+\epsilon, w_{l k}+\epsilon\right)-E\left(w_{j i}+\epsilon, w_{l k}-\epsilon\right)\right. \\ \left.\quad-E\left(w_{j i}-\epsilon, w_{l k}+\epsilon\right)+E\left(w_{j i}-\epsilon, w_{l k}-\epsilon\right)\right\}+O\left(\epsilon^{2}\right) \end{array}\]
$W^2$ elements in Hessian matrix

each element requiring 4 forward propagation, each needing $O(W)$
result : $O(W^3)$

can do it more efficient

by applying central differences to the first derivatives
\[\frac{\partial^{2} E}{\partial w_{j i} \partial w_{l k}}=\frac{1}{2 \epsilon}\left\{\frac{\partial E}{\partial w_{j i}}\left(w_{l k}+\epsilon\right)-\frac{\partial E}{\partial w_{j i}}\left(w_{l k}-\epsilon\right)\right\}+O\left(\epsilon^{2}\right)\]
$W^2$ elements in Hessian matrix

but only $W$ weights are to be perturbed ( first derivative is already calculated )

gradients can be evaluated in with $O(W)$ steps
result : $O(W^2)$

5-4. Regularization in Neural Networks

weight decay : $\widetilde{E}(\mathbf{w})=E(\mathbf{w})+\frac{\lambda}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}$

5-4-1. Consistent Gaussian priors

Limitation of simple weight decay :

inconsistent with scaling!

Transformation to input variable

$x_{i} \rightarrow \widetilde{x}_{i}=a x_{i}+b$.
weight and bias :

$\begin{aligned} w_{j i} \rightarrow \widetilde{w}_{j i} &=\frac{1}{a} w_{j i} \\ w_{j 0} \rightarrow \widetilde{w}_{j 0} &=w_{j 0}-\frac{b}{a} \sum_{i} w_{j i} \end{aligned}$,

Transformation to output variable

$y_{k} \rightarrow \widetilde{y}_{k}=c y_{k}+d$.
weight and bias :

$\begin{aligned} w_{k j} \rightarrow \widetilde{w}_{k j} &=c w_{k j} \\ w_{k 0} \rightarrow \widetilde{w}_{k 0} &=c w_{k 0}+d \end{aligned}$.

Therefore, we want regularize which is “INVARIANT” under linear transformation!

That is …

$\frac{\lambda_{1}}{2} \sum_{w \in \mathcal{W}_{1}} w^{2}+\frac{\lambda_{2}}{2} \sum_{w \in \mathcal{W}_{2}} w^{2}$.

$\lambda_{1} \rightarrow a^{1 / 2} \lambda_{1}$ and $\lambda_{2} \rightarrow c^{-1 / 2} \lambda_{2}$

It corresponds to a prior of…

\[p\left(\mathbf{w} \mid \alpha_{1}, \alpha_{2}\right) \propto \exp \left(-\frac{\alpha_{1}}{2} \sum_{w \in \mathcal{W}_{1}} w^{2}-\frac{\alpha_{2}}{2} \sum_{w \in \mathcal{W}_{2}} w^{2}\right)\]

( more generally : $p(\mathbf{w}) \propto \exp \left(-\frac{1}{2} \sum_{k} \alpha_{k}\|\mathbf{w}\|_{k}^{2}\right)$ )

If we choose the groups to corresponds to the sets of weights associated with each of the input units,

and optimize the marginal likelihood w.r.t the corresponding parameters $\alpha_k$ :

$\rightarrow$ ARD ( Automatic Relevance Determination ) … Chapter 7

5-4-2. Early stopping

5-5. Mixture Density networks

practical problems : “non-Gaussian” distributions

Mixture Density Networks :

use mixture model for $p(t\mid x)$, where both

mixing coefficients
component densities

are flexible function of input vector $x$

\[p(\mathbf{t} \mid \mathbf{x})=\sum_{k=1}^{K} \pi_{k}(\mathbf{x}) \mathcal{N}\left(\mathbf{t} \mid \boldsymbol{\mu}_{k}(\mathbf{x}), \sigma_{k}^{2}(\mathbf{x})\right)\]

Various parameters in the mixture model

mixture coefficients : $\pi_{k}(\mathrm{x})=\frac{\exp \left(a_{k}^{\pi}\right)}{\sum_{l=1}^{K} \exp \left(a_{l}^{\pi}\right)}$
means : $\mu_{k j}(\mathbf{x})=a_{k j}^{\mu}$
standard deviation : $\sigma_{k}(\mathrm{x})=\exp \left(a_{k}^{\sigma}\right)$

( all are governed by the NN that takes $x$ as an input )

Same function is used to predict the parameters!

( do not calculate “predicted $y$” directly, instead predict the “parameters ($\pi$ and $\mu$ and $\sigma$)” )

Error function ( = NLL )

\[E(\mathbf{w})=-\sum_{n=1}^{N} \ln \left\{\sum_{k=1}^{k} \pi_{k}\left(\mathbf{x}_{n}, \mathbf{w}\right) \mathcal{N}\left(\mathbf{t}_{n} \mid \boldsymbol{\mu}_{k}\left(\mathbf{x}_{n}, \mathbf{w}\right), \sigma_{k}^{2}\left(\mathbf{x}_{n}, \mathbf{w}\right)\right)\right\}\]

view mixing coefficient $\pi_k(x)$ as $x$-dependent prior probabilities!

then, the posterior is :

$\gamma_{k}(\mathbf{t} \mid \mathbf{x})=\frac{\pi_{k} \mathcal{N}_{n k}}{\sum_{l=1}^{K} \pi_{l} \mathcal{N}_{n l}}$ , where $\mathcal{N}_{n k}$ denotes $\mathcal{N}\left(\mathbf{t}_{n} \mid \boldsymbol{\mu}_{k}\left(\mathbf{x}_{n}\right), \sigma_{k}^{2}\left(\mathbf{x}_{n}\right)\right)$

Derivatives w.r.t output activations, governing the

1) mixture coefficients : $a_{k}^{\pi}$

$\rightarrow$ $\frac{\partial E_{n}}{\partial a_{k}^{\pi}}=\pi_{k}-\gamma_{k}$
2) component means : $a_{kl}^{\mu}$

$\rightarrow$ $\frac{\partial E_{n}}{\partial a_{k l}^{\mu}}=\gamma_{k}\left\{\frac{\mu_{k l}-t_{l}}{\sigma_{k}^{2}}\right\}$
3) component variances : $a_{k}^{\sigma}$

$\rightarrow$\frac{\partial E_{n}}{\partial a_{k}^{\sigma}}=-\gamma_{k}\left{\frac{\left|\mathbf{t}-\boldsymbol{\mu}{k}\right|^{2}}{\sigma{k}^{3}}-\frac{1}{\sigma_{k}}\right}$$

Conditional Mean and Variance

Conditional Mean : $\mathbb{E}[\mathbf{t} \mid \mathbf{x}]=\int \mathbf{t} p(\mathbf{t} \mid \mathbf{x}) \mathrm{d} \mathbf{t}=\sum_{k=1}^{K} \pi_{k}(\mathbf{x}) \boldsymbol{\mu}_{k}(\mathbf{x})$
Conditional Variance :
\[\begin{aligned} s^{2}(\mathrm{x}) &=\mathbb{E}\left[\|\mathrm{t}-\mathbb{E}[\mathrm{t} \mid \mathrm{x}]\|^{2} \mid \mathrm{x}\right] \\ &=\sum_{k=1}^{K} \pi_{k}(\mathrm{x})\left\{\sigma_{k}^{2}(\mathrm{x})+\left\|\mu_{k}(\mathrm{x})-\sum_{l=1}^{K} \pi_{l}(\mathrm{x}) \mu_{l}(\mathrm{x})\right\|^{2}\right\} \end{aligned}\]

5-6. Bayesian Neural Network

( until now, we have used MLE … from now on, MAP )

Regularzied ML = MAP

but in Bayesian approach, we need to “marginalize” over the parameters

Unlike linear regression, in multi-layerd NN,

exact Bayesian treatment can not be found! $\rightarrow$ Variational Inference ( Ch 10 )

Variational Inference

1) factorized Gaussian approximation to the posterior ( Hinton and van Camp, 1993 )
2) ….. full covariance Gaussian ( Barber and Bishop, 1998 )
3) Laplace approximation ( Mackay, 1992 )

Laplace Approximation

(1) approximate the posterior by Gaussian, centered at a mode of the true psoterior
(2) assume covariance of the Gaussian is so small, that network function is approximately linear!

5-6-1. Posterior parameter Distribution

Posterior $\propto$ Prior $\times$ Likelihood

Prior : $p(\mathbf{w} \mid \alpha)=\mathcal{N}\left(\mathbf{w} \mid \mathbf{0}, \alpha^{-1} \mathbf{I}\right)$
Likelihood ( = NN model ) : $p(\mathcal{D} \mid \mathbf{w}, \beta)=\prod_{n=1}^{N} \mathcal{N}\left(t_{n} \mid y\left(\mathbf{x}_{n}, \mathbf{w}\right), \beta^{-1}\right)$

Laplace Approximation

[step 1] find a (local) maximum of the posterior ( = mode )

$\rightarrow$ maximize the posterior! ( = $\mathbf{w}_{MAP}$)

$\rightarrow$ $\ln p(\mathbf{w} \mid \mathcal{D})=-\frac{\alpha}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}-\frac{\beta}{2} \sum_{n=1}^{N}\left\{y\left(\mathbf{x}_{n}, \mathbf{w}\right)-t_{n}\right\}^{2}+\mathrm{const}$

( assume $\alpha$ and $\beta$ are fixed )
[step 2] build a local Gaussian approximation by evaluating the matrix of second derivatives

$\rightarrow$ $\mathbf{A}=-\nabla \nabla \ln p(\mathbf{w} \mid \mathcal{D}, \alpha, \beta)=\alpha \mathbf{I}+\beta \mathbf{H}$
- $H$: Hessian matrix, comprising the second derivatives
Result ( from step 1 + step 2 ) :
- posterior : $q(\mathbf{w} \mid \mathcal{D})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{w}_{\mathrm{MAP}}, \mathbf{A}^{-1}\right)$
- predictive distribution : $p(t \mid \mathbf{x}, \mathcal{D})=\int p(t \mid \mathbf{x}, \mathbf{w}) q(\mathbf{w} \mid \mathcal{D}) \mathrm{d} \mathbf{w}$
  
  ( this integration is still intractable…)
  
  use Taylor Series expansion!

Taylor series expansion of network function around $\mathbf{w}_{MAP}$

( retain only linear terms! )

\[y(\mathbf{x}, \mathbf{w}) \simeq y\left(\mathbf{x}, \mathbf{w}_{\mathrm{MAP}}\right)+\mathbf{g}^{\mathbf{T}}\left(\mathbf{w}-\mathbf{w}_{\mathrm{MAP}}\right)\]

where $\mathbf{g}=\nabla_{\mathbf{w}} y(\mathbf{x}, \mathbf{w})\mid_{\mathbf{w}=\mathbf{w}_{\mathrm{MAP}}}$

With the approximation above…

we have “Gaussian” for $p(w)$ and “Gaussian” for $p(t \mid w)$

$\therefore p(t\mid x, \mathbf{w}, \beta) \simeq \mathcal{N}\left(t \mid y\left(\mathbf{x}, \mathbf{w}_{\mathrm{MAP}}\right)+\mathbf{g}^{\mathbf{T}}\left(\mathbf{w}-\mathbf{w}_{\mathrm{MAP}}\right), \beta^{-1}\right)$ —– (1)

if we marginalize out (1) w.r.t $w$

\[p(t \mid \mathbf{x}, \mathcal{D}, \alpha, \beta)=\mathcal{N}\left(t \mid y\left(\mathbf{x}, \mathbf{w}_{\mathrm{MAP}}\right), \sigma^{2}(\mathbf{x})\right)\]

where $\sigma^{2}(\mathrm{x})=\beta^{-1}+\mathrm{g}^{\mathrm{T}} \mathrm{A}^{-1} \mathrm{~g}$
- term 1) intrinsic noise on the target variable
- term 2) uncertainty in the interpolant due to the uncertainty in the model parameter

5-6-2. Hyperparameter optimization

hyperparmeter : $\alpha$ and $\beta$

In order to compare different models, we need to evaluate the model evidence!

Marginal Likelihood (=Evidence) for the hyperparameters :

integrate over the weights!
\[p(\mathcal{D} \mid \alpha, \beta)=\int p(\mathcal{D} \mid \mathbf{w}, \beta) p(\mathbf{w} \mid \alpha) \mathrm{d} \mathbf{w}\]
use Laplace Approximation ( check Ch 4.135 )
\[\ln p(\mathcal{D} \mid \alpha, \beta) \simeq-E\left(\mathbf{w}_{\mathrm{MAP}}\right)-\frac{1}{2} \ln |\mathbf{A}|+\frac{W}{2} \ln \alpha+\frac{N}{2} \ln \beta-\frac{N}{2} \ln (2 \pi)\]
where $E\left(\mathbf{w}_{\mathrm{MAP}}\right)=\frac{\beta}{2} \sum_{n=1}^{N}\left\{y\left(\mathbf{x}_{n}, \mathbf{w}_{\mathrm{MAP}}\right)-t_{n}\right\}^{2}+\frac{\alpha}{2} \mathbf{w}_{\mathrm{MAP}}^{\mathrm{T}} \mathbf{W}_{\mathrm{MAP}}$

Make point estimates for $\alpha$ and $\beta$, by maximizing $\ln p(\mathcal{D} \mid \alpha, \beta)$

$\beta \mathbf{H} \mathbf{u}_{i}=\lambda_{i} \mathbf{u}_{i}$ ( $H$ : Hessian matrix, comprising of 2nd derivative of SSE, at $w = w_{MAP}$ )
$\alpha=\frac{\gamma}{\mathrm{w}_{\mathrm{MAP}}^{\mathrm{T}} \mathrm{w}_{\mathrm{MAP}}}$.
effective number of parameters : $\gamma=\sum_{i=1}^{W} \frac{\lambda_{i}}{\alpha+\lambda_{i}}$

$\therefore$ maximizing the evidence w.r.t $\beta$:

\[\frac{1}{\beta}=\frac{1}{N-\gamma} \sum_{n=1}^{N}\left\{y\left(\mathbf{x}_{n}, \mathbf{w}_{\mathrm{MAP}}\right)-t_{n}\right\}^{2}\]

alternate between

1) updating the posterior distribution
2) re-estimation of the hyperparameters $\alpha,\beta$

5-6-3. Bayesian Neural Network for Classification

so far, we have used “Laplace Approximation” for Bayesian NN for regression.

How about classification?

example with ‘two-class classification’ ( with single logistic sigmoid )

log likelihood : $\ln p(\mathcal{D} \mid \mathbf{w})=\sum_{n}=1^{N}\left\{t_{n} \ln y_{n}+\left(1-t_{n}\right) \ln \left(1-y_{n}\right)\right\}$
no hyperparamter $\beta$ ( $\because$ data points are assumed to be correctly labeled )

[Step 1] initialize $\alpha$, and find $\mathbf{w}_{\text{MAP}}$

by maximzing log posterior distribution
= by minimizing $E(\mathbf{w})=-\ln p(\mathcal{D} \mid \mathbf{w})+\frac{\alpha}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}$

[Step 2] find Hessian matrix $\mathbf{H}$

second derivative of negative log posterior

Result ( from step 1 + step 2 ) :

posterior : $q(\mathbf{w} \mid \mathcal{D})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{w}_{\mathrm{MAP}}, \mathbf{A}^{-1}\right)$

where $\mathbf{A}=-\nabla \nabla \ln p(\mathbf{w} \mid \mathcal{D}, \alpha, \beta)=\alpha \mathbf{I}+\beta \mathbf{H}$

[Step 3] optimize hyperparameter $\alpha$

maximize the marginal likelihood, $\ln p(\mathcal{D} \mid \alpha) \simeq-E\left(\mathbf{w}_{\mathrm{MAP}}\right)-\frac{1}{2} \ln \mid \mathbf{A}\mid +\frac{W}{2} \ln \alpha+\mathrm{const}$
- $E\left(\mathrm{w}_{\mathrm{MAP}}\right)=-\sum_{n=1}^{N}\left\{t_{n} \ln y_{n}+\left(1-t_{n}\right) \ln \left(1-y_{n}\right)\right\}+\frac{\alpha}{2} \mathrm{w}_{\mathrm{MAP}}^{\mathrm{T}} \mathrm{w}_{\mathrm{MAP}}$.
- $y_{n} \equiv y\left(\mathrm{x}_{n}, \mathrm{w}_{\mathrm{MAP}}\right)$.
that is, $\alpha=\frac{\gamma}{\mathrm{w}_{\mathrm{MAP}}^{\mathrm{T}} \mathrm{w}_{\mathrm{MAP}}}$

[Step 4] find predictive distribution ( refer to Ch4.5.2 )

$p(t \mid \mathbf{x}, \mathcal{D})=\int p(t \mid \mathbf{x}, \mathbf{w}) q(\mathbf{w} \mid \mathcal{D}) \mathrm{d} \mathbf{w}$ is intractable
By Laplace approximation : $p(t \mid \mathrm{x}, \mathcal{D}) \simeq p\left(t \mid \mathrm{x}, \mathrm{w}_{\mathrm{MAP}}\right)$
improve it by considering “variance of the posterior”
- linear approximation for the output
  \[a(\mathbf{x}, \mathbf{w}) \simeq a_{\mathrm{MAP}}(\mathbf{x})+\mathbf{b}^{\mathrm{T}}\left(\mathbf{w}-\mathbf{w}_{\mathrm{MAP}}\right)\]
  where $a_{\mathrm{MAP}}(\mathrm{x})=a\left(\mathrm{x}, \mathrm{w}_{\mathrm{MAP}}\right)$ and $\mathrm{b} \equiv \nabla a\left(\mathrm{x}, \mathrm{w}_{\mathrm{MAP}}\right)$

\[p(a \mid \mathbf{x}, \mathcal{D})=\int \delta\left(a-a_{\operatorname{MAP}}(\mathbf{x})-\mathbf{b}^{\mathrm{T}}(\mathbf{x})\left(\mathbf{w}-\mathbf{w}_{\mathrm{MAP}}\right)\right) q(\mathbf{w} \mid \mathcal{D}) \mathrm{d} \mathbf{w}\]

mean : $a_{\mathrm{MAP}} \equiv a\left(\mathrm{x}, \mathrm{w}_{\mathrm{MAP}}\right),$
variance : $\sigma_{a}^{2}(\mathrm{x})=\mathrm{b}^{\mathrm{T}}(\mathrm{x}) \mathrm{A}^{-1} \mathrm{~b}(\mathrm{x})$

[Result] approximate predictive distribution :

To obtain predictive distribution, marginalize over $a$

\[p(t=1 \mid \mathrm{x}, \mathcal{D})=\int \sigma(a) p(a \mid \mathrm{x}, \mathcal{D}) \mathrm{d} a\]

after approximation using probit function…

\[p(t=1 \mid \mathbf{x}, \mathcal{D})=\sigma\left(\kappa\left(\sigma_{a}^{2}\right) \mathbf{b}^{\mathrm{T}} \mathbf{w}_{\mathrm{MAP}}\right)\]

Twitter Facebook LinkedIn

(PRML) Ch5.Neural Networks

Seunghan Lee