Overdispersed Black-Box Variational Inference (2016)


Abstract

method to reduce the variance of MC estimator of the gradient in BBVI

  • sample from the variational distn (X)
  • use importance sampling


Proposed method : readily applied to any exponential family distn

Over-dispersed Importance sampling scheme provides lower variance than BBVI


1. Introduction

(1) Generative Probabilistic Modeling

  • data generating process, through a joint distn of “observed data” & “latent variable”
  • use an inference algorithm to calculate/approximate the posterior


(2) Variational inference

  • ( traditional VI ) use coordinate-ascent to optimize its objective

    ( recent innovations ) use stochastic optimization

  • Must address the problem with MC estimates of gradient, which is HIGH VARIANCE


Several strategies to overcome this

  • 1) Rao-Blackwellization
  • 2) Reparameterization
  • 3) Local expectations


In this paper, suggests O-BBVI (Overdispersed BBVI)

  • new method for reducing the variance of MC gradients

  • main idea = “use IMPORTANCE SAMPLING” to estimate the gradient

    ( in order to construct a good proposal distn that is matched to the problem )


Technical summary

  • probabilistic model : \(p(x,z)\)

  • VI : use parameterized distn of latent variables \(q(z ; \lambda)\)


build on BBVI

  • \(L(\lambda)\) : variational objective ( =negative KL-div + constant )

    gradient of \(L(\lambda)\) : \(\nabla_{\lambda} \mathcal{L}=\mathbb{E}_{q(\mathbf{z} ; \lambda)}[f(\mathbf{z})]\)

    where \(f(\mathbf{z})=\nabla_{\lambda} \log q(\mathbf{z} ; \lambda)(\log p(\mathbf{x}, \mathbf{z})-\log q(\mathbf{z} ; \lambda))\)

  • Approximate the gradient with importance sampling!


Introduce a proposal distn : \(r(\mathbf{z} ; \lambda, \tau)\)

  • depends on both (1) variational param & (2) additional param

  • result : \(\nabla_{\lambda} \mathcal{L}=\mathbb{E}_{r(\mathbf{z} ; \lambda, \tau)}\left[f(\mathbf{z}) \frac{q(\mathbf{z} ; \lambda)}{r(\mathbf{z} ; \lambda, \tau)}\right]\).


2. Black-Box Variational Inference (BBVI)

variational family \(q(z ; \lambda)\)

  • \(q(\mathbf{z} ; \lambda)=g(\mathbf{z}) \exp \left\{\lambda^{\top} t(\mathbf{z})-A(\lambda)\right\}\).

    • \(g(\mathbf{z})\) : base measure
    • \(\lambda\) : natural parameters
    • \(t(\mathbf{z})\) : sufficient statistics
    • \(A(\lambda)\) : log normalizer
  • goal : minimize \(D_{\mathrm{KL}}(q(\mathbf{z} ; \lambda) \mid p(\mathbf{z} \mid \mathbf{x}))\)

    ( = maximize ELBO, \(\mathcal{L}(\lambda)=\mathbb{E}_{q(\mathbf{z} ; \lambda)}[\log p(\mathbf{x}, \mathbf{z})-\log q(\mathbf{z} ; \lambda)]\) )


with tractable variational family & conditionally conjugate model \(\rightarrow\) closed form!

But, not in real world…


BBVI :

  • uses “MC estimates of gradient” & requires few model-specific calculations

  • relies on log-derivative trick ( = REINFORCE or score function )

  • score function :

    \(\begin{array}{l} \nabla_{\lambda} q(\mathbf{z} ; \lambda)=q(\mathbf{z} ; \lambda) \nabla_{\lambda} \log q(\mathbf{z} ; \lambda) \\ \mathbb{E}_{q(\mathbf{z} ; \lambda)}\left[\nabla_{\lambda} \log q(\mathbf{z} ; \lambda)\right]=0 \end{array}\).

  • BUT, it may have HIGH VARIANCE

  • uses 2 strategies :

    • 1) control variates
    • 2) Rao-Blackwellization


(1) Control Variates

Properties

  • (1) r.v included in estimator
  • (2) same expectation + reducing variance
  • (3) many possible choices


Weighted Score function

  • not model dependent :)

  • each component \(n\) of the gradient in \(\nabla_{\lambda} \mathcal{L}=\mathbb{E}_{q(\mathbf{z} ; \lambda)}[f(\mathbf{z})]\)

    can be rewritten as…..

    \(\mathbb{E}_{q(\mathbf{z} ; \lambda)}\left[f_{n}(\mathbf{z})-a_{n} h_{n}(\mathbf{z})\right]\).

  • result )

    \(a_{n}=\frac{\operatorname{Cov}\left(f_{n}(\mathbf{z}), h_{n}(\mathbf{z})\right)}{\operatorname{Var}\left(h_{n}(\mathbf{z})\right)}\).


In BBVI, separate set of samples from \(q(\mathbf{z} ; \lambda)\) are used to estimate \(a_n\)


(2) Rao-Blackwellization

  • reduce variance of r.v, by “replacing it with its conditional expectation”

  • In BBVI, each component of the gradient is Rao-Blackwellized with respect to variables outside of the Markov blanket of the involved hidden variable

  • MFVI ) \(q(\mathbf{z} ; \lambda)=\prod_{n} q\left(z_{n} ; \lambda_{n}\right)\)

    \(\begin{aligned} \nabla_{\lambda_{n}} \mathcal{L}=\mathbb{E}_{q\left(\mathbf{z}_{(n)} ; \lambda_{(n)}\right)}[& \nabla_{\lambda_{n}} \log q\left(z_{n} ; \lambda_{n}\right) \left.\times\left(\log p_{n}\left(\mathrm{x}, \mathbf{z}_{(n)}\right)-\log q\left(z_{n} ; \lambda_{n}\right)\right)\right] \end{aligned}\).


3. Overdispersed Black-Box Variational Inference (O-BBVI)

O-BBVI : method for further reducing the variance

main idea : use IMPORTANCE SAMPLING

( does not sample from variational distribution \(q(\mathbf{z} ; \lambda)\) )


Takes samples from a proposal distn, \(r(\mathbf{z} ; \lambda, \tau)\)

  • importance weights = \(w(\mathbf{z})=q(\mathbf{z} ; \lambda) / r(\mathbf{z} ; \lambda, \tau)\)

    \(\rightarrow\) resulting estimator is unbiased


Optimal Proposal

one that minimizes the variance of the estimator is not variational distribution \(q(\mathbf{z} ; \lambda)\) !

Rather, it is….

\(r_{n}^{\star}(\mathbf{z}) \propto q(\mathbf{z} ; \lambda) \mid f_{n}(\mathbf{z}) \mid\).

  • but not tractable
  • thus, in O-BBVI, build an alternative proposal based on overdispersed exponential families


Overdispersed Proposal

motivation : optimal distn \(r_{n}^{\star}(\mathbf{z}) \propto q(\mathbf{z} ; \lambda) \mid f_{n}(\mathbf{z}) \mid\) assigns higher probability density to the tails of \(q(\mathbf{z} ; \lambda)\)

Thus, design a proposal distn \(r(\mathbf{z} ; \lambda, \tau)\), that assigns higher mass to the tail!

\[r(\mathbf{z} ; \lambda, \tau)=g(\mathbf{z}, \tau) \exp \left\{\frac{\lambda^{\top} t(\mathbf{z})-A(\lambda)}{\tau}\right\}\]
  • where \(\tau \geq 1\) is the dispersion coefficient of the overdispersed distn

Then, the estimator of the gradient is …

\(\widehat{\nabla}_{\lambda}^{\mathrm{O}-\mathrm{BB}} \mathcal{L}=\frac{1}{S} \sum_{s} f\left(\mathbf{z}^{(s)}\right) \frac{q\left(\mathbf{z}^{(s)} ; \lambda\right)}{r\left(\mathbf{z}^{(s)}\right)}, \quad \mathbf{z}^{(s)} \stackrel{\text { iid }}{\sim} r(\mathbf{z} ; \lambda, \tau)\).


Desired properties of \(r(\mathbf{z} ; \lambda, \tau)\)

  • 1) easy to sample from
  • 2) adaptive
  • 3) higher mass to the tails of \(q(\mathbf{z} ; \lambda)\)


Dispersion coefficient \(\tau\) can be itself adaptive to better match the optimal proposal at each iteration of variational optimization procedure.


3-1. Variance Reduction

O-BBVI vs BBVI

(1) BBVI

  • \(\widehat{\nabla}_{\lambda}^{\mathrm{BB}} \mathcal{L}=\frac{1}{S} \sum_{s} f\left(\mathbf{z}^{(s)}\right), \quad \mathbf{z}^{(s)} \stackrel{\mathrm{iid}}{\sim} q(\mathbf{z} ; \lambda)\).
  • \(\mathbb{V}\left[\widehat{\nabla}_{\lambda}^{\mathrm{BB}} \mathcal{L}\right]=\frac{1}{S} \mathbb{E}_{q(\mathbf{z} ; \lambda)}\left[f^{2}(\mathbf{z})\right]-\frac{1}{S}\left(\nabla_{\lambda} \mathcal{L}\right)^{2}\).

(2) O-BBVI

​ \(\begin{aligned} \mathbb{V}\left[\widehat{\nabla}_{\lambda}^{\mathrm{O}-\mathrm{BB}} \mathcal{L}\right] &=\frac{1}{S} \mathbb{E}_{r(\mathbf{z} ; \lambda, \tau)}\left[f^{2}(\mathbf{z}) \frac{q^{2}(\mathbf{z} ; \lambda)}{r^{2}(\mathbf{z} ; \lambda, \tau)}\right]-\frac{1}{S}\left(\nabla_{\lambda} \mathcal{L}\right)^{2} \\ &=\frac{1}{S} \mathbb{E}_{q(\mathbf{z} ; \lambda)}\left[f^{2}(\mathbf{z}) \frac{q(\mathbf{z} ; \lambda)}{r(\mathbf{z} ; \lambda, \tau)}\right]-\frac{1}{S}\left(\nabla_{\lambda} \mathcal{L}\right)^{2} \end{aligned}\).


figure2

Categories:

Updated: