[Paper Review] VI/BNN/NF paper 1~10

I have summarized the must read + advanced papers of papers regarding….

  • various methods using Variational Inference

  • Bayesian Neural Network

  • Probabilistic Deep Learning

  • Normalizing Flows

01. A Practical Bayesian Framework for Backpropagation Networks

MacKay, D. J. (1992)

( download paper here : Download )

  • how Bayesian Framework can be applied to Neural Network
  • loss function = train loss + regularizer
    • train loss : for good fit
    • regularizer : for generalization
  • summary : Download



02. Bayesian Learning for Neural Network (1)

Neal, R. M. (2012)

( download paper here : Download )

  • Bayesian view in NN : find predictive distribution by “Integration”, rather than “maximization”
  • BNN : not only single guess! also “UNCERTAINTY”
  • MAP(Maximum a posteriori probability) : act as PENALIZED likelihood ( penalty term by “prior” )
  • BNN = automatic Occam’s Razor
  • ARD (Automatic Relevance Determination model)
    • limit the number of input variables “automatically”
    • each input variable has its own hyperparameter, that controls the weight
  • review of MCMC
    • MC approximation : unbiased estimate
    • MH algorithm, Gibbs…
    • Hybrid Monte Carlo : auxiliary variable, ‘momentum’
  • summary : Download



02. Bayesian Learning for Neural Network (2)

  • priors of weight? obscure in NN
  • Infinte network = ‘non-parametric model’
    • “network with 1 hidden layers with INFINITE number of unit is an universal approximator”
    • “converges to GP(Gaussian Process)”
  • summary : Download



02. Bayesian Learning for Neural Network (3)

  • Hamiltonian Monte Carlo (HMC)
    • draw auxiliary momentum variable
    • calculate derivative
    • leap frog integrator
    • Metropolis acceptance step
  • HMC is the most promising MC method
  • summary : Download



03. Keeping Neural Networks Simple by Minimizing the Description Length of the Weights

Hinton, G. E., & Van Camp, D. (1993)

( download paper here : Download )

  • MDL (Minimum Description Length) Principle

    • NN generalize weights if “LESS information” is in the weights

    • loss(cost) = 1) + 2)

      • loss 1) cost for describing the model ( = weight penalty )
      • loss 2) describing the misfit between model & data ( = train loss )
    • bits back argument

      • step 1) sender collapse the weights drawn from \(Q(w)\)

      • step 2) sender sends each weight for \(Q(w)\) and sends the data misfit
      • step 3) receiver recovers the exact same posterior \(Q(w)\) with correct output & misfits
      • step 4) calculate true expected description length for a noisy weight
    • start of Variational Inference…?

  • summary : Download



04. Practical Variational Inference for Neural Networks

Graves, A. (2011)

( download paper here : Download )

  • introduce “Stochastic Variational method”

  • Key point

    • 1) instead of analytical solutions, use numerical integration
    • 2) stochastic method for VI with a diagonal Gaussian posterior
  • takes a view of MDL

    ( ELBO (Variational Free Energy)can be viewed with MDL principle! )

    • ELBO : entropy loss + complexity loss
    • MDL : cost of transmitting the model + cost of transmitting the prior
  • Diagonal Gaussian posterior

    • each weight requires a separate mean & variance
    • cannot compute derivative of loss function(-ELBO) directly…. use MC integration
  • summary : Download



05. Ensemble Learning in Bayesian Neural Networks

Charles Blundell, et.al ( 2015 )

( download paper here : Download )

  • Bayesian for NN : 3 approaches

    • 1) Gaussian Approximation
      • knwon as Laplace’s method
        • centered at the mode of \(p(w\mid D)\)
    • 2) MCMC
      • generate samples from the posterior
      • computationally expensive
      • ex) HMC
    • 3) Ensemble Learning
      • unlike Laplace’s method, fitted globally
  • Ensemble Learning

    • use ELBO / Variational Free Energy

    • minimize KL divergence

    • choice of \(Q\) ( approximating distribution )

      • should be close to true posterior
      • analytically tractable integration
    • original) diagonal covariance ( Hinton and van Camp, 1993 )

      proposed) full covariance

  • summary : Download



06. Weight Uncertainty in Neural Networks

Barber, D., & Bishop, C. M. (1998)

( download paper here : Download )

  • Bayes by Backprop

    • regularize weight by minimizing a compression cost
    • comparable performance to dropout
    • uncertainty can be used to improve generalization
    • exploration-exploitation in RL
  • BBB : instead of single networks, train “ensembles of networks”

    ( each network has its weights drawn from a distribution )

  • previous works

    • early stopping, weight decay, dropout ….
  • summary : Download



07. Expectation Propagation for Approximate Bayesian Inference

Minka, T. P. (2013)

( download paper here : Download )

  • Expectation Propagation

    • EP = ADF + Loopy belief propagation

      ( ADF = online Bayesian Learning, moment matching, weak marginalization… )

    • one-pass, sequential method for computing approximate posterior

  • novel interpretation of ADF

    • original ADF) approximate posterior that includes each observation term $t_i$
    • new interpretation) using an exact posterior with \(\tilde{t_i}\) (=ratio of new \& old posterior )
  • summary : Download



08. Probabilistic Backpropagation for Scalable Learning for Bayesian Neural Networks

Hernández-Lobato, J. M., & Adams, R. (2015, June)

( download paper here : Download )

  • Probabilistic NN

  • disadvantage of backpropagation

    • problem 1) have to tune large numbers of hyperparameters

    • problem 2) lack of calibrated probabilistic predictions

    • problem 3) tendency to overfit

      \(\rightarrow\) solve by Bayesian Approach ! with PBP

  • PBP ( Probabilistic Backpropagation )

    • scalable method for learning BNN

    • [step 1] forward propagataion of probabilities

      [step 2] backward computation of gradients

    • provides accurate estimates of the posterior variance

  • PBP solves problem 1)~3) by

    • 1) automatically infer ( by marginalizing out of the posterior )
    • 2) account for uncertainty
    • 3) average over the parameter values
  • summary : Download



09. Priors For Infinite Networks

Neal, R. M. (1994)

( part of 2.Bayesian Learning for Neural Network (2) )

( download paper here : Download )

  • infinite network = non-parametric model
  • Priors over functions reach reasonable limits, as the number of hidden units in the network goes infinity!
  • summary : Download



10. Computing with Infinite Networks

Williams, C. K. (1997)

( download paper here : Download )

  • when number of hidden units \(H \rightarrow \infty\), it is same as GP (Neal, 1994)

  • Neal (1994) : Infinite NN=GP , but does not give the covariance function

    this paper :

    • for certain weight priors (Gaussian) and transfer functions (Sigmoidal, Gaussian) in NN

      \(\rightarrow\) the covariance function of GP can be calculated analytically!

  • summary : Download