Using Large Ensembles of Control Variates for Variational Inference (NeurIPS 2018)

Abstract

VI : used with “stochastic optimization”

“gradient’s variance” is important ( high $\rightarrow$ poor convergence )
thus, use CONTROL VARIATES

Proposal

use **large number of control variates **!
present a Bayesian risk minimization framework

1. Introduction

VI = “approximate probabilistic inference”

nowadays, address a wider range of range of problems, by adopting black box view, based on

only evaluating the value or gradient of the target distribution

( optimized via stochastic gradient descent (SGD) )
use CVs ( control variates ) to reduce variance of gradient estimates

Proposes “how to use many CVs”

Contributions

1) propose a systematic view of “how to generate many existing CVs”
2) propose an algorithm to use “multiple CVs simultaneously”

2. Preliminaries

2-1. ELBO

VI = transform inference problem into “optimization”, by…

decomposing the “marginal likelihood” as below :

$\log p(x)=\underbrace{\underset{Z \sim q_{w}(Z)}{\mathbb{E}}\left[\log \frac{p(Z, x)}{q_{w}(Z)}\right]}_{\operatorname{ELBO}(w)}+\underbrace{\operatorname{KL}\left(q_{w}(Z) \mid \mid p(Z \mid x)\right)}_{\mathrm{KL}-\text { divergence }}$.
(target) maximize ELBO
target’s gradient :

$g(w)=\nabla_{w} \operatorname{ELBO}(w)=\nabla_{w} \underset{Z \sim q_{w}(Z)}{\mathbb{E}}\left[\log p(Z, x)-\log q_{w}(Z)\right]$.

2-2. Control Variates

CV = “r.v with expectation 0”

ex) let $C$ be CV

let $Y=X+aC$ then…
- same mean ) $E[Y] = E[X + aC] = E[X]$ has same ex
- smaller variance ) $\operatorname{Var}(Y)=\operatorname{Var}(X)\left(1-\operatorname{Corr}(X, C)^{2}\right)$.
  - optimal : $a=\operatorname{Cov}(X, C) / \operatorname{Var}(C)$.
  - good CV for $X$ = $C$ that is highly correlated with $X$

3. Systematic Generation of Control Variates

recipe for creating control variates!

split ELBO gradient into 4 terms :

$$g(w)=\underbrace{\nabla_{w} \underset{q_{w}}{\mathbb{E}} \log p(x \mid Z)}{g{1}(w): \text { Data term }}+\underbrace{\nabla_{w} \underset{q_{w}}{\mathbb{E}} \log p(Z)}{g{2}(w): \text { Prior term }}-\underbrace{\left.\nabla_{w} \underset{q_{w}}{\mathbb{E}} \log q_{v}(Z)\right

{v=w}}{g_{3}(w): \text { Variational term }}-\underbrace{\left.\nabla_{w} \underset{q_{v}}{\mathbb{E}} \log q_{w}(Z)\right

{v=w}}{g_{4}(w): \text { Score term }}$$.

[ term 1~3 ] influence of $w$ on the expectation of some function “independent of $w$“
- section 3-1 & 3-2
[ term 4 ] score term ( function “inside” the expectation depends on $w$ )
- section 3-3

3-1. CV from pairs of estimators

technique for deriving CVs = take the difference between a pair of unbiased estimators of a general term $t(w)$( = $g_1, g_2, g_3$, or combination of them ) ….. have expectation zero

general term : $\begin{aligned} &t(w)=\nabla_{w} \underset{q_{w}(Z)}{\mathbb{E}}[f(Z)]\quad \text { or } \quad \left.\left(\nabla_{w} \underset{q_{w}(Z)}{\mathbb{E}}\left[f_{v}(Z)\right]\right)\right|_{v=w} \end{aligned}$

example) 5 types of estimator for $t(w)$

3-2. CV from approximations

in 3-1… use the fact that “difference of 2 unbiased estimators has expectation zero”

in 3-2… use the fact that “if general term $t(w)$ is replaced with APPROXIMATION, difference between 2 estimators of the general term still produces a valid CV”

Randomness in the estimators above : due to….

1) “distributional sampling error”
- expectations over $q_w$ are approximated by sampling
2) “data subsampling error”
- by drawing a mini-batch

1) Correcting for distributional sampling

goal : approximate $f$ with $\tilde{f}$ to make $\mathbb{E}[\tilde{f}(Z)]$ easier to estimate
ex) Paisley et al : approximate the data term with “Taylor approximation in $z$“

2) Correcting for data subsampling

random subsets of data
ex) Wang et al : propose to approximate $f_d(z)$ with “Taylor expansion in $x$“

thus leading to an approximate data term, $\tilde{g}_{1}(z)=\nabla_{w} \mathbb{E}_{q_{w}} \mathbb{E}_{D} \tilde{f}_{D}(z)$

3-3. CV from the score term $g_4$

score term = always zero! ( $g_4(w)=0$ )….thus no need to be estimated

naive control variate : $T_4 = \nabla_w \log q_w(Z)$

4. Combining multiple CVs

to use CV, need to define…

a base gradient estimator $h(w) \in R^D$
set of CVs $\{c_1,...,c_L\}$ where $c_i \in R^D$

Multiply each $c_i$ with weight $a_i$… to get the estimator :

$\hat{g}(w)=h(w)+\sum_{i=1}^{L} a_{i} c_{i}(w)$.
\[\hat{g}(w)=h(w)+C(w) a\]
- \[a \in \mathbb{R}^{L}\]
- $C \in \mathbb{R}^{D \times L}$ as the matrix with $c_{i}$ as the i-th column

Goal : find $a$ such that final gradient has low variance!

[Lemma 4.1]

Let $h(w) \in \mathbb{R}^{D}$ be a random variable, $C(w) \in \mathbb{R}^{L \times D}$ a matrix of random variables such that each element has mean zero. For $a \in \mathbb{R}^{L}$, define $\hat{g}(w)=h(w)+C(w)$ a.

$\rightarrow$ The value of a that minimizes $\mathbb{E}\mid \mid\hat{g}(w)\mid \mid^{2}$ for a given $w$ is

$a^{*}(w)=-\underset{p(C, h \mid w)}{\mathbb{E}}\left[C^{T} C\right]^{-1} \mathbb{E}\left[C^{T} h\right]$.

requires the $\mathbb{E}\left[C^{T} C\right]$ and $\mathbb{E}\left[C^{T} h\right]$ ( usually not available in closed form )
solution ) give some observed gradients $h_1,...,h_M$ & CVs $C_1,...,C_M$ to estimate $a^{*}$

4-1. Bayesian Regularization

“risk minimization” perspective

[Loss Function] for selecting the vector of weights $a$, when true parameter is $\theta$ :

loss function : $L(a, \theta)=\underset{C, h \mid \theta}{\mathbb{E}}\mid \mid h+C a\mid \mid^{2}$
decision rule : $\alpha\left(C_{1}, h_{1}, \ldots, C_{M}, h_{M}\right)$
- which takes as input as “mini-batch” of $M$ evaluations of $h$ and $C$

Then, for a pre-specified probabilistic model $p(C, h, \theta)$….

Bayesian regret : $\text { BayesRegret }(\alpha)=\underset{\theta}{\mathbb{E}} \underset{C_{1}, h_{1}, \ldots, C_{M}, h_{M} \mid \theta}{\mathbb{E}}\left[L\left(\alpha\left(C_{1}, h_{1}, \ldots, C_{M}, h_{M}\right), \theta\right)\right]$.

5. Conclusion

Focus on how to obtain low variance gradients given a fixed set of control variates!

Combination algorithm to use multiple control variates

Twitter Facebook LinkedIn

67.Using Large Ensembles of Control Variates for Variational Inference

Seunghan Lee