Liberty or Depth ; Deep BNN do not need complex weight posterior approximation ( NeurIPS 2020 )


Abstract

MFVI

  • severely restrictive!
  • BUT NOT in the case of DEEP newtorks!

prove that “deep mean-field variational weight posteriors” can induce similar distributions in function-space to those induced by shallower networks with complex weight posteriors


1. Introduction

VI in BNNs…ex) MFVI

MFVI : severe limitation ( \(\because\) correlations between weights )


But, not using MFVI…? too heavy computation

  • ex) Structured covariance methods : bad time complexity


mean-field BNNs ( Wu et al. 2019 )


Argue that larger, deeper networks for mean-field approximation matters less!

  • 1) simple parametric functions need complicated weight-distns to induce rich distns in function-space
  • 2) complicated parametric functions can induce the same function-space distns with simple weight-ditsn


One way to have an expressive approximate posterior predictive

  • 1) simple likelihood & rich approximate posterior over weights \(q(\theta)\)

  • 2) simple \(q(\theta)\) & rich likelihood

    ( ex. deeper model mapping \(x\) to \(y\) )


2 hypothesis

  • 1) Weight Distribution hypothesis

    • for BNN with full-cov weight distn,

      there exists deeper BNN with “mean-field” weight distn

  • 2) True Posterior hypothesis

    • for sufficiently deep & wide BNN

      there exists mean-field distn over the weights of that BNN,

      which induces the same distn over function values as that induced by posterior predictive

  • hypothesis 1) suggest that “shallow complex-covariance” = “deeper mean-field”

    ( show that Matrix Variate Gaussian distn is a special case of 3 layer product matrix distn )

    ( thus, allowing MFVI to model rich covariances )

  • hypothesis 2) states that mean-field weight distns can approximate the true predictive posterior


2. Related Work

(1) MacKay (1992)

  • using MF approximation for Bayesian inference in NN is severe limitation
  • “diagonal approximation” is no good, because of strong posterior correlations!

(2) Barber and Bishop (1998)

  • “full-covariance VI”
  • poor time complexity

(3) etc

  • structured-covariance approxximations…

    ( still have unattractive time complexity )


But there has been no work in DNN! (only shallow NN)


3. Emergence of Complex Covariance in Deep Linear Mean-Field Networks

1) Weight Distribution hypothesis is TRUE in linear networks


(1) Defining a Product matrix

  • activation function of NN : \(\phi(\cdot)\)
  • \(L\) weight matrices for a deep LINEAR model can be flattened ( called product matrix (\(M^{(L)}\)) )
  • since the model is linear…
    • there is one-to-one mapping


(2) Covariance of the Product matrix

  • derive an analytic form of covariance of the product matrix \(M^{(L)}\)

    ( this holds for ANY factorized weight distn with finite first & second order moment )


figure2

  • shows that deep MF linear model is able to induce function-space distns which would require covariance between weights in a shallower model


(3) Numerical Simulation

  • visualize the covariance between entries of the product matrix from a deep mean-field VI linear model

figure2


(4) How Expressive is the Product Matrix?

  • MVG (Matrix Variate Gaussian) distn is a special case of the mean-field product matrix distn

figure2


4. Weight Dist Hypothesis in Deep Piecewise-Linear Mean Field BNNs

NN use non-linear activations

( these non-linearity make it impossible to consider product matrices :( )


Instead, show how to define local product matrix ( extension of product matrix )

  • with piecewise-linear activation funcs
  • ex) ReLUs, Leakly ReLUs


(1) Defining a Local Product Matrix

  • NN with piecewise-linear activations induce piecewise-linear functions
  • each region can be identified by a sign vector ( = switch on & off )


(2) Covariance of Local Product Matrix

  • given mean-field distn over weights of NN \(f\) with piecewise linear activations,

    \(f\) can be written in terms of the local product matrix \(P_{x^{*}}\) within \(A\)

figure2


5. True Posterior Hypothesis in 2-hidden layer Mean-Field Networks

prove 2) True Posterior hypothesis using Universal Approximation Theorem (UAT)

  • shows that BNN with mean-field approximate posterior with at leas 2 layers of hidden units can induce a function-space distribution that matches any true posterior distribution over function values arbitrarily closely.

Categories:

Updated: