Using Large Ensembles of Control Variates for Variational Inference (NeurIPS 2018)

Abstract

VI : used with “stochastic optimization”

  • “gradient’s variance” is important ( high \(\rightarrow\) poor convergence )
  • thus, use CONTROL VARIATES


Proposal

  • use **large number of control variates **!
  • present a Bayesian risk minimization framework


1. Introduction

VI = “approximate probabilistic inference”

  • nowadays, address a wider range of range of problems, by adopting black box view, based on

    only evaluating the value or gradient of the target distribution

    ( optimized via stochastic gradient descent (SGD) )

  • use CVs ( control variates ) to reduce variance of gradient estimates


Proposes “how to use many CVs”


Contributions

  • 1) propose a systematic view of “how to generate many existing CVs”
  • 2) propose an algorithm to use “multiple CVs simultaneously”


2. Preliminaries

2-1. ELBO

VI = transform inference problem into “optimization”, by…

  • decomposing the “marginal likelihood” as below :

    \(\log p(x)=\underbrace{\underset{Z \sim q_{w}(Z)}{\mathbb{E}}\left[\log \frac{p(Z, x)}{q_{w}(Z)}\right]}_{\operatorname{ELBO}(w)}+\underbrace{\operatorname{KL}\left(q_{w}(Z) \mid \mid p(Z \mid x)\right)}_{\mathrm{KL}-\text { divergence }}\).

  • (target) maximize ELBO

  • target’s gradient :

    \(g(w)=\nabla_{w} \operatorname{ELBO}(w)=\nabla_{w} \underset{Z \sim q_{w}(Z)}{\mathbb{E}}\left[\log p(Z, x)-\log q_{w}(Z)\right]\).


2-2. Control Variates

CV = “r.v with expectation 0”

  • ex) let \(C\) be CV

    let \(Y=X+aC\) then…

    • same mean ) \(E[Y] = E[X + aC] = E[X]\) has same ex
    • smaller variance ) \(\operatorname{Var}(Y)=\operatorname{Var}(X)\left(1-\operatorname{Corr}(X, C)^{2}\right)\).
      • optimal : \(a=\operatorname{Cov}(X, C) / \operatorname{Var}(C)\).
      • good CV for \(X\) = \(C\) that is highly correlated with \(X\)


3. Systematic Generation of Control Variates

recipe for creating control variates!

split ELBO gradient into 4 terms :

$$g(w)=\underbrace{\nabla_{w} \underset{q_{w}}{\mathbb{E}} \log p(x \mid Z)}{g{1}(w): \text { Data term }}+\underbrace{\nabla_{w} \underset{q_{w}}{\mathbb{E}} \log p(Z)}{g{2}(w): \text { Prior term }}-\underbrace{\left.\nabla_{w} \underset{q_{w}}{\mathbb{E}} \log q_{v}(Z)\right {v=w}}{g_{3}(w): \text { Variational term }}-\underbrace{\left.\nabla_{w} \underset{q_{v}}{\mathbb{E}} \log q_{w}(Z)\right {v=w}}{g_{4}(w): \text { Score term }}$$.
  • [ term 1~3 ] influence of \(w\) on the expectation of some function “independent of \(w\)“
    • section 3-1 & 3-2
  • [ term 4 ] score term ( function “inside” the expectation depends on \(w\) )
    • section 3-3


3-1. CV from pairs of estimators

technique for deriving CVs = take the difference between a pair of unbiased estimators of a general term \(t(w)\)( = \(g_1, g_2, g_3\), or combination of them​ ) ….. have expectation zero

general term : \(\begin{aligned} &t(w)=\nabla_{w} \underset{q_{w}(Z)}{\mathbb{E}}[f(Z)]\quad \text { or } \quad \left.\left(\nabla_{w} \underset{q_{w}(Z)}{\mathbb{E}}\left[f_{v}(Z)\right]\right)\right|_{v=w} \end{aligned}\)

example) 5 types of estimator for \(t(w)\)

figure2


3-2. CV from approximations

in 3-1… use the fact that “difference of 2 unbiased estimators has expectation zero”

in 3-2… use the fact that “if general term \(t(w)\) is replaced with APPROXIMATION, difference between 2 estimators of the general term still produces a valid CV”


Randomness in the estimators above : due to….

  • 1) “distributional sampling error”
    • expectations over \(q_w\) are approximated by sampling
  • 2) “data subsampling error”
    • by drawing a mini-batch


1) Correcting for distributional sampling

  • goal : approximate \(f\) with \(\tilde{f}\) to make \(\mathbb{E}[\tilde{f}(Z)]\) easier to estimate
  • ex) Paisley et al : approximate the data term with “Taylor approximation in \(z\)“

2) Correcting for data subsampling

  • random subsets of data

  • ex) Wang et al : propose to approximate \(f_d(z)\) with “Taylor expansion in \(x\)“

    thus leading to an approximate data term, \(\tilde{g}_{1}(z)=\nabla_{w} \mathbb{E}_{q_{w}} \mathbb{E}_{D} \tilde{f}_{D}(z)\)


3-3. CV from the score term \(g_4\)

score term = always zero! ( \(g_4(w)=0\) )….thus no need to be estimated

naive control variate : \(T_4 = \nabla_w \log q_w(Z)\)


4. Combining multiple CVs

to use CV, need to define…

  • a base gradient estimator \(h(w) \in R^D\)
  • set of CVs \(\{c_1,...,c_L\}\) where \(c_i \in R^D\)


Multiply each \(c_i\) with weight \(a_i\)… to get the estimator :

  • \(\hat{g}(w)=h(w)+\sum_{i=1}^{L} a_{i} c_{i}(w)\).

  • \[\hat{g}(w)=h(w)+C(w) a\]
    • \[a \in \mathbb{R}^{L}\]
    • \(C \in \mathbb{R}^{D \times L}\) as the matrix with \(c_{i}\) as the i-th column


Goal : find \(a\) such that final gradient has low variance!


[Lemma 4.1]

Let \(h(w) \in \mathbb{R}^{D}\) be a random variable, \(C(w) \in \mathbb{R}^{L \times D}\) a matrix of random variables such that each element has mean zero. For \(a \in \mathbb{R}^{L}\), define \(\hat{g}(w)=h(w)+C(w)\) a.

\(\rightarrow\) The value of a that minimizes \(\mathbb{E}\mid \mid\hat{g}(w)\mid \mid^{2}\) for a given \(w\) is

$a^{*}(w)=-\underset{p(C, h \mid w)}{\mathbb{E}}\left[C^{T} C\right]^{-1} \mathbb{E}\left[C^{T} h\right]$.

  • requires the \(\mathbb{E}\left[C^{T} C\right]\) and \(\mathbb{E}\left[C^{T} h\right]\) ( usually not available in closed form )
  • solution ) give some observed gradients \(h_1,...,h_M\) & CVs \(C_1,...,C_M\) to estimate \(a^{*}\)


4-1. Bayesian Regularization

“risk minimization” perspective

[Loss Function] for selecting the vector of weights \(a\), when true parameter is \(\theta\) :

  • loss function : \(L(a, \theta)=\underset{C, h \mid \theta}{\mathbb{E}}\mid \mid h+C a\mid \mid^{2}\)
  • decision rule : \(\alpha\left(C_{1}, h_{1}, \ldots, C_{M}, h_{M}\right)\)
    • which takes as input as “mini-batch” of \(M\) evaluations of \(h\) and \(C\)


Then, for a pre-specified probabilistic model \(p(C, h, \theta)\)….

  • Bayesian regret : \(\text { BayesRegret }(\alpha)=\underset{\theta}{\mathbb{E}} \underset{C_{1}, h_{1}, \ldots, C_{M}, h_{M} \mid \theta}{\mathbb{E}}\left[L\left(\alpha\left(C_{1}, h_{1}, \ldots, C_{M}, h_{M}\right), \theta\right)\right]\).


5. Conclusion

Focus on how to obtain low variance gradients given a fixed set of control variates!

Combination algorithm to use multiple control variates

Categories:

Updated: