Variational Bayesian Monte Carlo (NeurIPS 2018)

Abstract

Nowadays model : “Expensive, Black-box likelihoods” \(\rightarrow\) hard for Bayesian Inference, due to impracticality

Thus, propose VBMC (Variational Bayesian Monte Carlo)

  • novel sample-efficient inference framework
  • (1) Variational Inference + (2) GP-based active-sampling Bayesian quadrature


1. Introduction

with Bayesian Inference : parameter & model uncertainty by computing posetrior distn over params & model evidence

  • MCMC & VI

  • posterior : \(p(\boldsymbol{x} \mid \mathcal{D})=\frac{p(\mathcal{D} \mid \boldsymbol{x}) p(\boldsymbol{x})}{p(\mathcal{D})}\).
  • evidence : \(p(\mathcal{D})=\int p(\mathcal{D} \mid \boldsymbol{x}) p(\boldsymbol{x}) d \boldsymbol{x}\).
  • \(p(\mathcal{D} \mid x)\) : likelihood of model …………….Black-box & expensive function
  • \(p(x)\): prior


Introduce VBMC (Variational Bayesian Monte Carlo)

  • (1) Variational Inference + (2) GP-based active-sampling Bayesian quadrature
  • simultaneous approximation of the posterior & of the model evidence in sample-efficient manner


2. Theoretical Background

2-1. Variational Inference

KL-divergence : \(\mathrm{KL}\left[q_{\phi}(x) \mid \mid p(x \mid \mathcal{D})\right]=\mathbb{E}_{\phi}\left[\log \frac{q_{\phi}(x)}{p(x \mid \mathcal{D})}\right]\)

Log-likelihood : \(\log p(\mathcal{D})=\mathcal{F}\left[q_{\phi}\right]+\operatorname{KL}\left[q_{\phi}(x) \mid \mid p(x \mid \mathcal{D})\right] \\\)

ELBO : \(\mathcal{F}\left[q_{\phi}\right]=\mathbb{E}_{\phi}\left[\log \frac{p(\mathcal{D} \mid x) p(x)}{q_{\phi}(x)}\right]=\mathbb{E}_{\phi}[f(x)]+\mathcal{H}\left[q_{\phi}(x)\right]\)

  • \(q\) : chosen to belong to family ( ex. factorized posterior, or mean-field )

    providing closed-form equations for coordinate ascent algorithm

  • \(f(x)\) : expensive black-box function


2-2. Bayesian Quadrature

Also known as Bayesian Monte Carlo

  • means to obtain Bayesian estimates of mean & var of non-analytical integrals

    of the form \(\langle f\rangle=\int f(\boldsymbol{x}) \pi(\boldsymbol{x}) d \boldsymbol{x}\)

  • \(f\) : function of interest ( ex. GP )

  • \(\pi\) : known pdf


GP (Gaussian Process)

  • prior distn over unknown functions

  • defined by (1) mean & (2) kernel function

  • common choice : ( Gaussian kernel )

    \[\kappa\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\sigma_{f}^{2} \mathcal{N}\left(\boldsymbol{x} ; \boldsymbol{x}^{\prime}, \boldsymbol{\Sigma}_{\ell}\right),\]

    with \(\boldsymbol{\Sigma}_{\ell}=\operatorname{diag}\left[\ell^{(1)^{2}}, \ldots, \ell^{(D)^{2}}\right]\)


conditioned on inputs \(\mathbf{X}=\left\{x_{1}, \ldots, x_{n}\right\}\) , GP posterior will have latent posterior conditional mean & covariance :

  • mean ) \(\bar{f}_{\Xi}(\boldsymbol{x}) \equiv \bar{f}(\boldsymbol{x} ; \boldsymbol{\Xi}, \psi)\)
  • cov ) \(C_{\Xi}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) \equiv C\left(\boldsymbol{x}, \boldsymbol{x}^{\prime} ; \boldsymbol{\Xi}, \psi\right)\)


Bayesian Integration

Posterior mean & Variance of \(\int f(x) \pi(x) d x\) :

  • mean ) \(\mathbb{E}_{f \mid \Xi}[\langle f\rangle]=\int \bar{f}_{\Xi}(x) \pi(x) d x\)
  • variance ) \(\mathbb{V}_{f \mid \Xi}[\langle f\rangle]=\iint C_{\Xi}\left(x, x^{\prime}\right) \pi(x) d x \pi\left(x^{\prime}\right) d x^{\prime}\)

if \(f\) has Gaussian kernel & \(\pi\) : (mixture of) Gaussian \(\rightarrow\) can be calculated analytically


Active sampling

given budget of samples \(n_{\max }\)…

  • fixed GP hyperparams \(\psi\) : optimal variance-minimizing design does not depend on \(\mathbf{X}\)
  • if \(\psi\) updated online : variance of posterior will depend on the function values & active sampling strategy is desirbale

acquisition function : what to choose next?

  • \(x_{\text {next }}=\operatorname{argmax}_{x} a(x)\).


3. Variational Bayesian Monte Carlo ( VBMC )

Simple Algorithm

  • 1) sequentially sample
  • 2) train GP model
  • 3) update variational posterior approximation ( \(\phi_t\) ) by maximizing ELBO


3-1. Variational Posterior

choose variational Posterior \(q(x)\) as “flexible nonparametric” family ( ex. \(K\) Gaussian )

  • \(q(\boldsymbol{x}) \equiv q_{\boldsymbol{\phi}}(\boldsymbol{x})=\sum_{k=1}^{K} w_{k} \mathcal{N}\left(\boldsymbol{x} ; \boldsymbol{\mu}_{k}, \sigma_{k}^{2} \boldsymbol{\Sigma}\right)\).

  • variational posterior : \(\phi \equiv\left(w_{1}, \ldots, w_{K}, \mu_{1}, \ldots, \mu_{K}, \sigma_{1}, \ldots, \sigma_{K}, \lambda\right),\)

    ( = \(K(D+2)+D\) parameters )


3-2. ELBO

\(\mathcal{F}\left[q_{\phi}\right]=\mathbb{E}_{\phi}\left[\log \frac{p(\mathcal{D} \mid x) p(x)}{q_{\phi}(x)}\right]=\mathbb{E}_{\phi}[f(x)]+\mathcal{H}\left[q_{\phi}(x)\right]\).

[ Approximate ELBO 2 ways ]

(1) approximate log joint probability \(f\)

  • with a GP(Gaussian Process)

    • kernel ) with a Gaussian Kernel

    • mean ) negative quadratic mean function

      \(m(\boldsymbol{x})=m_{0}-\frac{1}{2} \sum_{i=1}^{D} \frac{\left(x^{(i)}-x_{\mathrm{m}}^{(i)}\right)^{2}}{\omega^{(i)^{2}}}\).

      ( unlike constant function, ensures that posterior GP predictive mean \(\bar{f}\) is a proper log prob distn )

(2) approximate Entropy \(\mathcal{H}\left[q_{\phi}(x)\right]\)

  • via MC sampling ( & propagate using reparam trick )

with (1) & (2), optimize (negative mean) ELBO with SGD!


ELCBO ( Evidence Lower Confidence Bound )

\(\operatorname{ELCBO}(\phi, f)=\mathbb{E}_{f \mid \Xi}\left[\mathbb{E}_{\phi}[f]\right]+\mathcal{H}\left[q_{\phi}\right]-\beta_{\mathrm{LCB}} \sqrt{\mathbb{V}_{f \mid \Xi}\left[\mathbb{E}_{\phi}[f]\right]}\).

  • last term : uncertainty in the computation of the expected log joint, multiplied by risk-sensitivity parameter


3-3. Active Sampling

Why? to compute a sequence of integrals \(\mathbb{E}_{\phi_{2}}[f], \ldots, \mathbb{E}_{\phi_{T}}[f]\), such that

  • 1) sequence of variational params \(\phi_t\) converges to variational posterior ( that minimizes KL-div )
  • 2) have minimum variance on our final estimate of ELBO


2 acquisition functions for VBMC

  • 1) Vanilla uncertainty sampling : \(a_{\text {us }}(x)=V_{\Xi}(x) q_{\phi}(x)^{2}\)
    • only maximizes the variance under current variational params
    • lack exploration
  • 2) Prospective uncertainty sampling : \(a_{\mathrm{pro}}(\boldsymbol{x})=V_{\Xi}(\boldsymbol{x}) q_{\phi}(\boldsymbol{x}) \exp \left(\bar{f}_{\Xi}(\boldsymbol{x})\right)\)
    • promotes exploration


3-4. Adaptive treatment of GP hyperparameters

GP in VBMC has \(3 D+3\) hyperparams, \(\psi=\left(\ell, \sigma_{f}, \sigma_{\text {obs }}, m_{0}, x_{\mathrm{m}}, \omega\right)\)

  • impose empirical Bayes prior on GP hyperparams, based on current training set

  • sample from the posterior over hyperparams via slice sampling


Given (hyperparam) samples \(\{\psi\} \equiv\left\{\psi_{1}, \ldots, \psi_{n_{\mathrm{gp}}}\right\},\) expected mean & variance :

  • mean ) \(\mathbb{E}[\chi \mid\{\psi\}]=\frac{1}{n_{\mathrm{gp}}} \sum_{j=1}^{n_{\mathrm{gp}}} \mathbb{E}\left[\chi \mid \psi_{j}\right]\)
  • variance ) \(\mathbb{V}[\chi \mid\{\psi\}]=\frac{1}{n_{\mathrm{gp}}} \sum_{j=1}^{n_{\mathrm{gp}}} \mathbb{V}\left[\chi \mid \psi_{j}\right]+\operatorname{Var}\left[\left\{\mathbb{E}\left[\chi \mid \psi_{j}\right]\right\}_{j=1}^{n_{\mathrm{gp}}}\right]\)


3-5. Initialization and Warm-up

initialized by starting point \(x_0\) & PLB, PUB

  • additional points are generated uniformly at random between PLB~PUB ( \(n_{init}=10\) )


Warm-up

  • initialize variational posterior with \(K=2\) components in the vicinity of \(x_0\)

    ( with small values of \(\sigma_1\), \(\sigma_2\), \(\lambda\) )

  • in warm up mode…

    • VBMC quickly improve ELBO by moving to region with higher posterior probability
    • collect maximum of \(n_{gp}=8\) hyperparam samples
  • at the end of warm-up….

    trim the training set by removing points ( whose log joint prob is more than 10*\(D\) points lower than the max val \(y_{max}\) )


3-6. Adaptive number of variational mixture components

after warm-up… add / remove variational components

ADDING components

  • at every iteration, \(K +=1\)
  • maximum number of components : \(K_{max} = n^{2/3}\)


REMOVING components

  • consider a candidate for pruning \(k\), with mixture weight \(w_k < w_{min}\)
  • recompute ELCBO without selected \(k\)


figure2

Categories:

Updated: