Neural Adaptive Sequential Monte Carlo
Abstract
SMC (Sequential Monte Carlo) = particle filtering
- sampling from an intractable target
- use sequence of simpler intermediate distn
- dependent on proposal distn
Presents a new method for “automatically adapting the proposal”,
using an approximation of KL-div between “true posterior” & “posterior distn”
\(\rightarrow\) Neural Adaptive SMC ( NASMC )
1. Introduction
SMC : the sequence constructs a proposal for importance sampling
Importance Sampling : dependent on the choice of proposal
- bad proposal \(\rightarrow\) low effective sample size & high variance
SMC have deeveloped approaches to solve this!
- ex) resampling to improve particle diversity, when the effective sample size is low
- ex) applying MCMC transition kernels
This paper suggests…“new gradient-based black-box adaptive SMC method”
- that automatically tunes flexible proposal distn
- can be assessed using KL-div between (1) target & (2) parameterized proposal
- very general & tractably handles complex proposal distn
2. Sequential Monte Carlo
2 fundamental SMC algorithms :
- **Sequential Importance Sampling (SIS) **
-
Sequential Importance Resampling (SIR)
-
probabilistic model containing hidden state (\(\boldsymbol{z}_{1: T}\)) & observed state(\(\boldsymbol{x}_{1: T}\))
\(\rightarrow\) joint distn : \(p\left(z_{1: T}, x_{1: T}\right)=p\left(z_{1}\right) p\left(x_{1} \mid z_{1}\right) \prod_{t=2}^{T} p\left(z_{t} \mid z_{1: t-1}\right) p\left(x_{t} \mid z_{1: t}, x_{1: t-1}\right)\)
(1) Sequential Importance Sampling (SIS)
Goal : approximate posterior, over hidden state sequence through a weighted set of \(N\) sampled trajectories drawn from simpler proposal distribution \(\left\{z_{1: T}^{(n)}\right\}_{n=1: N} \sim q\left(z_{1: T} \mid x_{1: T}\right)\)
- \(p\left(\boldsymbol{z}_{1: T} \mid \boldsymbol{x}_{1: T}\right) \approx \sum_{n=1}^{N} \tilde{w}_{t}^{(n)} \delta\left(\boldsymbol{z}_{1: T}-\boldsymbol{z}_{1: T}^{(n)}\right)\).
Factorize proposal distribution :
- \(q\left(\boldsymbol{z}_{1: T} \mid \boldsymbol{x}_{1: T}\right)=q\left(\boldsymbol{z}_{1} \mid \boldsymbol{x}_{1}\right) \prod_{t=2}^{T} q\left(\boldsymbol{z}_{t} \mid \boldsymbol{z}_{1: t-1}, \boldsymbol{x}_{1: t}\right)\).
Normalized Importance Weights ( defined by recursion )
- \(w\left(\boldsymbol{z}_{1: T}^{(n)}\right)=\frac{p\left(\boldsymbol{z}_{1: T}^{(n)}, \boldsymbol{x}_{1: T}\right)}{q\left(\boldsymbol{z}_{1: T}^{(n)} \mid \boldsymbol{x}_{1: T}\right)}\).
- \(\tilde{w}\left(\boldsymbol{z}_{1: T}^{(n)}\right)=\frac{w\left(\boldsymbol{z}_{1: T}^{(n)}\right)}{\sum_{n} w\left(\boldsymbol{z}_{1: T}^{(n)}\right)} \propto \tilde{w}\left(\boldsymbol{z}_{1: T-1}^{(n)}\right) \frac{p\left(\boldsymbol{z}_{T}^{(n)} \mid \boldsymbol{z}_{1: T-1}^{(n)}\right) p\left(\boldsymbol{x}_{T} \mid \boldsymbol{z}_{1: T}^{(n)}, \boldsymbol{x}_{1: T-1}\right)}{q\left(\boldsymbol{z}_{T}^{(n)} \mid \boldsymbol{z}_{1: T-1}^{(n)}, \boldsymbol{x}_{1: T}\right)}\).
Problem?
- highly skewed as \(t\) increases
\(\rightarrow\) use Sequential Importance Resampling (SIR)
(2) Sequential Importance Resampling (SIR)
-
add “additional step that resamples \(\boldsymbol{z}_{t}^{(n)}\) at time \(t\),
from a Multinomial distribution given by \(\tilde{w}\left(\boldsymbol{z}_{1: t}^{(n)}\right)\)
-
requires knowledge of full trajectory of previous samples
\(\rightarrow\) each new particle needs to update its ancestry information
( \(a_{\tau, t}^{(n)}\) : ancestral index of particle \(n\) at time \(t\) for state \(z_{\tau}\) )
2-1. The Critical Role of Proposal Distributions in SMC
Optimal choice : “intractable posterior” \(q_{\phi}\left(z_{1: T} \mid x_{1: T}\right)=p_{\theta}\left(z_{1: T} \mid x_{1: T}\right)\)
- ( factorization ) \(q\left(z_{t} \mid z_{1: t-1}, x_{1: t}\right)=p\left(z_{t} \mid z_{1: t-1}, x_{1: t}\right)\)
Bootstrap filter
- use prior as proposal distn!
- \(q\left(z_{t} \mid z_{1: t-1}, x_{1: t}\right)=p\left(z_{t} \mid z_{1: t-1}, x_{1: t-1}\right)\).
3. Adapting Proposals by Descending the Inclusive KL-divergence
“Proposal distn will be optimized using inclusive KL-div”
- \(\operatorname{KL}\left[p_{\theta}\left(\boldsymbol{z}_{1: T} \mid \boldsymbol{x}_{1: T}\right) \| q_{\phi}\left(\boldsymbol{z}_{1: T} \mid \boldsymbol{x}_{1: T}\right)\right]\).
Why this objective function?
-
1) direct measure of quality of the proposal
-
2) ( if true posterior lies in proposal family )
\(\rightarrow\) it has a global optimum at this point
-
3) ( if true posterior does not lie in proposal family )
\(\rightarrow\) tends to find proposal distn that have higher entropy than the original
-
4) derivative can be approximated efficiently
Gradient of negative KL-div ( + approximated using samples from SMC ) :
\(\begin{aligned}-\frac{\partial}{\partial \phi} \operatorname{KL}\left[p_{\theta}\left(z_{1: T} \mid \boldsymbol{x}_{1: T}\right) \| q_{\phi}\left(\boldsymbol{z}_{1: T} \mid \boldsymbol{x}_{1: T}\right)\right]&=\int p_{\theta}\left(\boldsymbol{z}_{1: T} \mid \boldsymbol{x}_{1: T}\right) \frac{\partial}{\partial \phi} \log q_{\phi}\left(\boldsymbol{z}_{1: T} \mid \boldsymbol{x}_{1: T}\right) d \boldsymbol{z}_{1: T}\\& \approx \sum_{t} \sum_{n} \tilde{w}_{t}^{(n)} \frac{\partial}{\partial \phi} \log q_{\phi}\left(\boldsymbol{z}_{t}^{(n)} \mid \boldsymbol{x}_{1: t}, \boldsymbol{z}_{1: t-1}^{A_{t-1}^{(n)}}\right) \end{aligned}\).
Online & Batch variants
- gradient update for the model params \(\theta\)
- \(\frac{\partial}{\partial \theta} \log \left[p_{\theta}\left(\boldsymbol{x}_{1: T}\right)\right] \approx \sum_{t} \sum_{n} \tilde{w}_{t}^{(n)} \frac{\partial}{\partial \theta} \log p_{\theta}\left(\boldsymbol{x}_{t}, \boldsymbol{z}_{t}^{(n)} \mid \boldsymbol{x}_{1: t-1}, \boldsymbol{z}_{1: t-1}^{A_{t-1}^{(n)}}\right)\).