[Paper Review] VI/BNN/NF paper 1~10

I have summarized the must read + advanced papers of papers regarding….

various methods using Variational Inference
Bayesian Neural Network
Probabilistic Deep Learning
Normalizing Flows

01. A Practical Bayesian Framework for Backpropagation Networks

MacKay, D. J. (1992)

( download paper here : Download )

how Bayesian Framework can be applied to Neural Network
loss function = train loss + regularizer
- train loss : for good fit
- regularizer : for generalization
summary : Download

02. Bayesian Learning for Neural Network (1)

Neal, R. M. (2012)

( download paper here : Download )

Bayesian view in NN : find predictive distribution by “Integration”, rather than “maximization”
BNN : not only single guess! also “UNCERTAINTY”
MAP(Maximum a posteriori probability) : act as PENALIZED likelihood ( penalty term by “prior” )
BNN = automatic Occam’s Razor
ARD (Automatic Relevance Determination model)
- limit the number of input variables “automatically”
- each input variable has its own hyperparameter, that controls the weight
review of MCMC
- MC approximation : unbiased estimate
- MH algorithm, Gibbs…
- Hybrid Monte Carlo : auxiliary variable, ‘momentum’
summary : Download

02. Bayesian Learning for Neural Network (2)

priors of weight? obscure in NN
Infinte network = ‘non-parametric model’
- “network with 1 hidden layers with INFINITE number of unit is an universal approximator”
- “converges to GP(Gaussian Process)”
summary : Download

02. Bayesian Learning for Neural Network (3)

Hamiltonian Monte Carlo (HMC)
- draw auxiliary momentum variable
- calculate derivative
- leap frog integrator
- Metropolis acceptance step
HMC is the most promising MC method
summary : Download

03. Keeping Neural Networks Simple by Minimizing the Description Length of the Weights

Hinton, G. E., & Van Camp, D. (1993)

( download paper here : Download )

MDL (Minimum Description Length) Principle
- NN generalize weights if “LESS information” is in the weights
- loss(cost) = 1) + 2)
  - loss 1) cost for describing the model ( = weight penalty )
  - loss 2) describing the misfit between model & data ( = train loss )
- bits back argument
  - step 1) sender collapse the weights drawn from $Q(w)$
  - step 2) sender sends each weight for $Q(w)$ and sends the data misfit
  - step 3) receiver recovers the exact same posterior $Q(w)$ with correct output & misfits
  - step 4) calculate true expected description length for a noisy weight
- start of Variational Inference…?
summary : Download

04. Practical Variational Inference for Neural Networks

Graves, A. (2011)

( download paper here : Download )

introduce “Stochastic Variational method”
Key point
- 1) instead of analytical solutions, use numerical integration
- 2) stochastic method for VI with a diagonal Gaussian posterior
takes a view of MDL

( ELBO (Variational Free Energy)can be viewed with MDL principle! )
- ELBO : entropy loss + complexity loss
- MDL : cost of transmitting the model + cost of transmitting the prior
Diagonal Gaussian posterior
- each weight requires a separate mean & variance
- cannot compute derivative of loss function(-ELBO) directly…. use MC integration
summary : Download

05. Ensemble Learning in Bayesian Neural Networks

Charles Blundell, et.al ( 2015 )

( download paper here : Download )

Bayesian for NN : 3 approaches
- 1) Gaussian Approximation
  - knwon as Laplace’s method
    - centered at the mode of $p(w\mid D)$
- 2) MCMC
  - generate samples from the posterior
  - computationally expensive
  - ex) HMC
- 3) Ensemble Learning
  - unlike Laplace’s method, fitted globally
Ensemble Learning
- use ELBO / Variational Free Energy
- minimize KL divergence
- choice of $Q$ ( approximating distribution )
  - should be close to true posterior
  - analytically tractable integration
- original) diagonal covariance ( Hinton and van Camp, 1993 )
  
  proposed) full covariance
summary : Download

06. Weight Uncertainty in Neural Networks

Barber, D., & Bishop, C. M. (1998)

( download paper here : Download )

Bayes by Backprop
- regularize weight by minimizing a compression cost
- comparable performance to dropout
- uncertainty can be used to improve generalization
- exploration-exploitation in RL
BBB : instead of single networks, train “ensembles of networks”

( each network has its weights drawn from a distribution )
previous works
- early stopping, weight decay, dropout ….
summary : Download

07. Expectation Propagation for Approximate Bayesian Inference

Minka, T. P. (2013)

( download paper here : Download )

Expectation Propagation
- EP = ADF + Loopy belief propagation
  
  ( ADF = online Bayesian Learning, moment matching, weak marginalization… )
- one-pass, sequential method for computing approximate posterior
novel interpretation of ADF
- original ADF) approximate posterior that includes each observation term $t_i$
- new interpretation) using an exact posterior with $\tilde{t_i}$ (=ratio of new \& old posterior )
summary : Download

08. Probabilistic Backpropagation for Scalable Learning for Bayesian Neural Networks

Hernández-Lobato, J. M., & Adams, R. (2015, June)

( download paper here : Download )

Probabilistic NN
disadvantage of backpropagation
- problem 1) have to tune large numbers of hyperparameters
- problem 2) lack of calibrated probabilistic predictions
- problem 3) tendency to overfit
  
  $\rightarrow$ solve by Bayesian Approach ! with PBP
PBP ( Probabilistic Backpropagation )
- scalable method for learning BNN
- [step 1] forward propagataion of probabilities
  
  [step 2] backward computation of gradients
- provides accurate estimates of the posterior variance
PBP solves problem 1)~3) by
- 1) automatically infer ( by marginalizing out of the posterior )
- 2) account for uncertainty
- 3) average over the parameter values
summary : Download

09. Priors For Infinite Networks

Neal, R. M. (1994)

( part of 2.Bayesian Learning for Neural Network (2) )

( download paper here : Download )

infinite network = non-parametric model
Priors over functions reach reasonable limits, as the number of hidden units in the network goes infinity!
summary : Download

10. Computing with Infinite Networks

Williams, C. K. (1997)

( download paper here : Download )

when number of hidden units $H \rightarrow \infty$, it is same as GP (Neal, 1994)
Neal (1994) : Infinite NN=GP , but does not give the covariance function

this paper :
- for certain weight priors (Gaussian) and transfer functions (Sigmoidal, Gaussian) in NN
  
  $\rightarrow$ the covariance function of GP can be calculated analytically!
summary : Download

Twitter Facebook LinkedIn

(PAPER REVIEW) VI/BNN/NF paper 1~10

Seunghan Lee

[Paper Review] VI/BNN/NF paper 1~10

01. A Practical Bayesian Framework for Backpropagation Networks

MacKay, D. J. (1992)

02. Bayesian Learning for Neural Network (1)

Neal, R. M. (2012)

02. Bayesian Learning for Neural Network (2)

02. Bayesian Learning for Neural Network (3)

03. Keeping Neural Networks Simple by Minimizing the Description Length of the Weights

Hinton, G. E., & Van Camp, D. (1993)

04. Practical Variational Inference for Neural Networks

Graves, A. (2011)

05. Ensemble Learning in Bayesian Neural Networks

Charles Blundell, et.al ( 2015 )

06. Weight Uncertainty in Neural Networks

Barber, D., & Bishop, C. M. (1998)

07. Expectation Propagation for Approximate Bayesian Inference

Minka, T. P. (2013)

08. Probabilistic Backpropagation for Scalable Learning for Bayesian Neural Networks

Hernández-Lobato, J. M., & Adams, R. (2015, June)

09. Priors For Infinite Networks

Neal, R. M. (1994)

10. Computing with Infinite Networks

Williams, C. K. (1997)

You May Also Enjoy