Liberty or Depth ; Deep BNN do not need complex weight posterior approximation ( NeurIPS 2020 )

Abstract

MFVI

severely restrictive!
BUT NOT in the case of DEEP newtorks!

prove that “deep mean-field variational weight posteriors” can induce similar distributions in function-space to those induced by shallower networks with complex weight posteriors

1. Introduction

VI in BNNs…ex) MFVI

MFVI : severe limitation ( \(\because\) correlations between weights )

But, not using MFVI…? too heavy computation

ex) Structured covariance methods : bad time complexity

mean-field BNNs ( Wu et al. 2019 )

Argue that larger, deeper networks for mean-field approximation matters less!

1) simple parametric functions need complicated weight-distns to induce rich distns in function-space
2) complicated parametric functions can induce the same function-space distns with simple weight-ditsn

One way to have an expressive approximate posterior predictive

1) simple likelihood & rich approximate posterior over weights \(q(\theta)\)
2) simple \(q(\theta)\) & rich likelihood

( ex. deeper model mapping \(x\) to \(y\) )

2 hypothesis

1) Weight Distribution hypothesis
- for BNN with full-cov weight distn,
  
  there exists deeper BNN with “mean-field” weight distn
2) True Posterior hypothesis
- for sufficiently deep & wide BNN
  
  there exists mean-field distn over the weights of that BNN,
  
  which induces the same distn over function values as that induced by posterior predictive
hypothesis 1) suggest that “shallow complex-covariance” = “deeper mean-field”

( show that Matrix Variate Gaussian distn is a special case of 3 layer product matrix distn )

( thus, allowing MFVI to model rich covariances )
hypothesis 2) states that mean-field weight distns can approximate the true predictive posterior

(1) MacKay (1992)

using MF approximation for Bayesian inference in NN is severe limitation
“diagonal approximation” is no good, because of strong posterior correlations!

(2) Barber and Bishop (1998)

“full-covariance VI”
poor time complexity

(3) etc

structured-covariance approxximations…

( still have unattractive time complexity )

But there has been no work in DNN! (only shallow NN)

3. Emergence of Complex Covariance in Deep Linear Mean-Field Networks

1) Weight Distribution hypothesis is TRUE in linear networks

(1) Defining a Product matrix

activation function of NN : \(\phi(\cdot)\)
\(L\) weight matrices for a deep LINEAR model can be flattened ( called product matrix (\(M^{(L)}\)) )
since the model is linear…
- there is one-to-one mapping

(2) Covariance of the Product matrix

derive an analytic form of covariance of the product matrix \(M^{(L)}\)

( this holds for ANY factorized weight distn with finite first & second order moment )

shows that deep MF linear model is able to induce function-space distns which would require covariance between weights in a shallower model

(3) Numerical Simulation

visualize the covariance between entries of the product matrix from a deep mean-field VI linear model

(4) How Expressive is the Product Matrix?

MVG (Matrix Variate Gaussian) distn is a special case of the mean-field product matrix distn

4. Weight Dist Hypothesis in Deep Piecewise-Linear Mean Field BNNs

NN use non-linear activations

( these non-linearity make it impossible to consider product matrices :( )

Instead, show how to define local product matrix ( extension of product matrix )

with piecewise-linear activation funcs
ex) ReLUs, Leakly ReLUs

(1) Defining a Local Product Matrix

NN with piecewise-linear activations induce piecewise-linear functions
each region can be identified by a sign vector ( = switch on & off )

(2) Covariance of Local Product Matrix

given mean-field distn over weights of NN \(f\) with piecewise linear activations,

\(f\) can be written in terms of the local product matrix \(P_{x^{*}}\) within \(A\)

5. True Posterior Hypothesis in 2-hidden layer Mean-Field Networks

prove 2) True Posterior hypothesis using Universal Approximation Theorem (UAT)

shows that BNN with mean-field approximate posterior with at leas 2 layers of hidden units can induce a function-space distribution that matches any true posterior distribution over function values arbitrarily closely.

Twitter Facebook LinkedIn

88.Liberty or Depth ; Deep BNN do not need complex weight posterior approximation

Seunghan Lee

Liberty or Depth ; Deep BNN do not need complex weight posterior approximation ( NeurIPS 2020 )

Abstract

1. Introduction

3. Emergence of Complex Covariance in Deep Linear Mean-Field Networks

4. Weight Dist Hypothesis in Deep Piecewise-Linear Mean Field BNNs

5. True Posterior Hypothesis in 2-hidden layer Mean-Field Networks

You May Also Enjoy

88.Liberty or Depth ; Deep BNN do not need complex weight posterior approximation

Seunghan Lee

Liberty or Depth ; Deep BNN do not need complex weight posterior approximation ( NeurIPS 2020 )

Abstract

1. Introduction

2. Related Work

3. Emergence of Complex Covariance in Deep Linear Mean-Field Networks

4. Weight Dist Hypothesis in Deep Piecewise-Linear Mean Field BNNs

5. True Posterior Hypothesis in 2-hidden layer Mean-Field Networks

You May Also Enjoy