Sparse Variational Inference : Bayesian Coresets from Scrath (NeurIPS 2019)


Abstract

Recent work on Bayesian coresets :

  • compressing the data, before inference algorithm!
  • goal : both (1) scalability & (2) guaranteees on posterior approximation error


Proposition :

  • formulate coreset construction as “sparsity-constrained VI within an exponential family”


1. Introduction

Bayesian Inference is typically intractable : use MCMC or VI

Both methods were created on per-model basis.

But developments in automated tools have made it accessible to practitioners

  • ex) automatic differentiation, black-box gradient estimates, Hamiltonian transition kernels

  • but that’s not enough! should be computationally scalable!


Bayesian Coreset ( = core of a dataset )

  • alternative approach
  • (assumption) large datasets = contain redundant data
  • computationally inexpensive & simple


Proposition :

provide a new formulation of coreset construction as exponential family VI with sparsity constraint


2. Background

Pdf : \(\pi(\theta)\)

  • decompose into \(N\) potentials : \(\left(f_{n}(\theta)\right)_{n=1}^{N}\)
  • base density : \(\pi_{0}(\theta)\)

Decomposition : \(\pi(\theta):=\frac{1}{Z} \exp \left(\sum_{n=1}^{N} f_{n}(\theta)\right) \pi_{0}(\theta)\)


Often intractable to compute expectations under \(\pi\) exactly…

To reduce the cost of MCMC… instead run it on small, weighted subset of data = “Bayesian Coreset”


WEIGHT : (Sparse vector) \(w \in \mathbb{R}_{>0}^{N}\) ( only \(M \ll N\) are nonzero )

Approximate full log-density, with \(w\)-reweighted sum

  • \(\pi_{w}(\theta):=\frac{1}{Z(w)} \exp \left(\sum_{n=1}^{N} w_{n} f_{n}(\theta)\right) \pi_{0}(\theta)\).,

    where only \(M \ll N,\) are nonzero ( \(\mid \mid w \mid \mid _{0}:=\sum_{n=1}^{N} \mathbb{1}\left[w_{n}>0\right] \leq M\) )

  • \(\pi_{1}=\pi\) : full density

  • If \(M \ll N,\) evaluating a function proportional to \(\pi_{w}\) is much less expensive than original \(\pi\)

  • Goal : find a set of weights \(w\), that

    • 1) makes \(\pi_w\) as close as \(\pi\) ( for accuracy )
    • 2) maintain sparsity ( for speed )


into Sparse Regression Problem :

  • \(\begin{array}{l} w^{\star}=\underset{w \in \mathbb{R}^{N}}{\arg \min } \mathbb{E}_{\hat{\pi}}\left[\left(\sum_{n=1}^{N} g_{n}-\sum_{n=1}^{N} w_{n} g_{n}\right)^{2}\right] \text { s.t. } w \geq 0, \mid \mid w \mid \mid _{0} \leq M \end{array}\).

    where \(g_{n}:=\left(f_{n}-\mathbb{E}_{\hat{\pi}}\left[f_{n}\right]\right)\)


using MC approximation :

  • \(w^{\star}=\underset{w \in \mathbb{R}^{N}}{\arg \min } \mid \mid \sum_{n=1}^{N} \hat{g}_{n}-\sum_{n=1}^{N} w_{n} \hat{g}_{n} \mid \mid _{2}^{2} \text { s.t. } w \geq 0, \mid \mid w \mid \mid _{0} \leq M\).
  • where
    • \(\left(\theta_{s}\right)_{s=1}^{S} \stackrel{\text { i.i.d. }}{\sim} \hat{\pi}\).
    • \(\hat{g}_{n}=\sqrt{S}^{-1}\left[g_{n}\left(\theta_{1}\right)-\bar{g}_{n}, \ldots, g_{n}\left(\theta_{S}\right)-\bar{g}_{n}\right]^{T} \in\).
  • requires the selection of \(\hat{\pi}\)


3. Bayesian Coresets from scratch

new formulation of Bayesian coreset construction!


3-1. Sparse exponential family VI

treat as a sparse VI problem!

  • \(w^{\star}=\underset{w \in \mathbb{R}^{N}}{\arg \min } \quad \mathrm{D}_{\mathrm{KL}}\left(\pi_{w} \mid \pi_{1}\right) \quad \text { s.t. } \quad w \geq 0,\mid \mid w \mid \mid _{0} \leq M\).
  • \(\mathrm{D}_{\mathrm{KL}}\left(\pi_{w} \mid \pi_{1}\right)=\log Z(1)-\log Z(w)-\sum_{n=1}^{N}\left(1-w_{n}\right) \mathbb{E}_{w}\left[f_{n}(\theta)\right]\) ……… (a)


problem with (a) ?

  • (1) \(Z(w)\) is unknown
  • (2) computing the objective requires expectation under \(\pi_w\)


Propose New Algorithm!

  • coresets from a sparse subset of an exponential family
  • ( let \(f(\theta):=\left[\begin{array}{lll}f_{1}(\theta) \ldots f_{N}(\theta) \end{array}\right]^{T}\) )
  • \(\pi_{w}(\theta):=\exp \left(w^{T} f(\theta)-\log Z(w)\right) \pi_{0}(\theta)\).
    • properties of exponential family : \(\mathbb{E}_{w}[f(\theta)]=\nabla_{w} \log Z(w)\). …….. (b)
  • Using (b)…
    • before ) \(w^{\star}=\underset{w \in \mathbb{R}^{N}}{\arg \min } \quad \log Z(1)-\log Z(w)-\sum_{n=1}^{N}\left(1-w_{n}\right) \mathbb{E}_{w}\left[f_{n}(\theta)\right]\)
    • after ) \(w^{\star}=\underset{w \in \mathbb{R}^{N}}{\arg \min } \log Z(1)-\log Z(w)-(1-w)^{T} \nabla_{w} \log Z(w)\).


Derivative of objective function ( KL-div )

  • \(\nabla_{w} \mathrm{D}_{\mathrm{KL}}\left(\pi_{w} \mid \pi_{1}\right)=-\nabla_{w}^{2} \log Z(w)(1-w)=-\operatorname{Cov}_{w}\left[f, f^{T}(1-w)\right]\).

  • That is..

    • increase in \(w_n\) by small amount \(\rightarrow\) small decreases \(\mathrm{D}_{\mathrm{KL}}\left(\pi_{w} \mid \pi_{1}\right)\).

      ( by an amount proportional to the cov of the n\(^{th}\) potential \(f_{n}(\theta)\) with residual error \(\sum_{n=1}^{N} f_{n}(\theta)-\sum_{n=1}^{N} w_{n} f_{n}(\theta)\) under \(\pi_{w}\) )


3-2. Greedy Selection

Naive approach : select potential that provides largest local decrease in KL-div

( = select the one with largest covariance with the residual error)

\(n^{\star}=\underset{n \in[N]}{\arg \max }\left\{\begin{array}{l} \left|\operatorname{Corr}_{w}\left[f_{n}, f^{T}(1-w)\right]\right| \quad w_{n}>0 \\ \operatorname{Corr}_{w}\left[f_{n}, f^{T}(1-w)\right] \quad \quad w_{n}=0 \end{array}\right.\).


use MC estimates via sampling from \(\pi_w\)

\(\widehat{\mathrm{Corr}}=\operatorname{diag}\left[\frac{1}{S} \sum_{s=1}^{S} \hat{g}_{s} \hat{g}_{s}^{T}\right]^{-\frac{1}{2}}\left(\frac{1}{S} \sum_{s=1}^{S} \hat{g}_{s} \hat{g}_{s}^{T}(1-w)\right) \quad \hat{g}_{s}:=\left[\begin{array}{c} f_{1}\left(\theta_{s}\right) \\ \vdots \\ f_{N}\left(\theta_{s}\right) \end{array}\right]-\frac{1}{S} \sum_{r=1}^{S}\left[\begin{array}{c} f_{1}\left(\theta_{r}\right) \\ \vdots \\ f_{N}\left(\theta_{r}\right) \end{array}\right]\).


3-3. Weight Update

after selecting new potential function \(n^{\star}\)…..

  • add it to the active set of indicies \(\mathcal{I} \subseteq[N]\)

  • Update the weights, by optimizing

    \(w^{\star}=\underset{v \in \mathbb{R}^{N}}{\arg \min } \mathrm{D}_{\mathrm{KL}}\left(\pi_{v} \mid \pi\right) \quad \text { s.t. } \quad v \geq 0, \quad\left(1-1_{\mathcal{I}}\right)^{T} v=0\).

  • MC estimate (\(D\)) of gradient \(\nabla_{w} \mathrm{D}_{\mathrm{KL}}\left(\pi_{w} \mid \pi_{1}\right)\)

    \(D:=-\frac{1}{S} \sum_{s=1}^{S} \hat{g}_{s} \hat{g}_{s}^{T}(1-w) \in \mathbb{R}^{N}\).


Stochastic Gradient step : \(w_{n} \leftarrow w_{n}-\gamma_{t} D_{n}\).


Algorithm.

figure2


4. Conclusion

Exploit the fact that “coreset posteriors form an exponential family”

Introduce Sparse VI for Bayesian coreset construction!

Categories:

Updated: