Sequential Monte Carlo


1. Motivation

Goal : “estimating unknown quantities”

Bayesian Models : 1) + 2)

  • 1) prior distn for unknown quantities
  • 2) likelihood functions

Often, observations arrive sequentially in time

\(\rightarrow\) interested in inference “on-line”

\(\therefore\) Update the posterior, as data become available

  • key point? Computational Simplicity! ( not having to store all the data )


Examples )

  • Linear Gaussian state-space model : exact analytical expression

    ( this recursion is known as “Kalman Filter” )

  • Partially observed, finite state-space Markov chain

    ( known as “HMM filter” )

both rely on various assumptions….. but real world data is complex!


Sequential Monte Carlo

  • simulation-based methods
  • convenient to compute posterior
  • very flexible & easy to implement, parallelisable


2. Problem Statement

Notation

  • observation : \(\left\{\mathbf{y}_{t} ; t \in \mathbb{N}^{*}\right\}\)

  • unobserved signal : \(\left\{\mathbf{x}_{t} ; t \in \mathbb{N}\right\}, \mathbf{x}_{t} \in \mathcal{X}\)
  • initial distribution : \(p\left(\mathbf{x}_{0}\right)\)
  • transition equation : \(p\left(\mathbf{x}_{t} \mid \mathbf{x}_{t-1}\right) .\)

  • process : \(\left\{\mathrm{x}_{t} ; t \in \mathbb{N}\right\}\)
  • marginal distribution : \(p\left(\mathbf{y}_{t} \mid \mathbf{x}_{t}\right)\)


Goal : “estimate recursively in time”…

  • 1) posterior distribution : \(p\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right),\)
  • 2) its associated features ( ex. Marginal distribution \(p\left(\mathbf{x}_{t} \mid \mathbf{y}_{1: t}\right)\) )
  • 3) expectations : \(I\left(f_{t}\right)=\mathbb{E}_{p\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)}\left[f_{t}\left(\mathbf{x}_{0: t}\right)\right] \triangleq \int f_{t}\left(\mathbf{x}_{0: t}\right) p\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right) d \mathbf{x}_{0: t}\).


(1) Posterior distribution

  • by Bayes theorem
  • \(p\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)=\frac{p\left(\mathbf{y}_{1: t} \mid \mathbf{x}_{0: t}\right) p\left(\mathbf{x}_{0: t}\right)}{\int p\left(\mathbf{y}_{1: t} \mid \mathbf{x}_{0: t}\right) p\left(\mathbf{x}_{0: t}\right) d \mathbf{x}_{0: t}}\).


(2) Joint distribution ( + recursive formula )

  • \(p\left(\mathbf{x}_{0: t+1} \mid \mathbf{y}_{1: t+1}\right)=p\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right) \frac{p\left(\mathbf{y}_{t+1} \mid \mathbf{x}_{t+1}\right) p\left(\mathbf{x}_{t+1} \mid \mathbf{x}_{t}\right)}{p\left(\mathbf{y}_{t+1} \mid \mathbf{y}_{1: t}\right)}\).


(3) Marginal distribution

  • 1) prediction

    \(p\left(\mathbf{x}_{t} \mid \mathbf{y}_{1: t-1}\right)=\int p\left(\mathbf{x}_{t} \mid \mathbf{x}_{t-1}\right) p\left(\mathbf{x}_{t-1} \mid \mathbf{y}_{1: t-1}\right) d \mathbf{x}_{t-1}\).

  • 2) Updating

    \(p\left(\mathbf{x}_{t} \mid \mathbf{y}_{1: t}\right)=\frac{p\left(\mathbf{y}_{t} \mid \mathbf{x}_{t}\right) p\left(\mathbf{x}_{t} \mid \mathbf{y}_{1: t-1}\right)}{\int p\left(\mathbf{y}_{t} \mid \mathbf{x}_{t}\right) p\left(\mathbf{x}_{t} \mid \mathbf{y}_{1: t-1}\right) d \mathbf{x}_{t}}\).


3. Monte Carlo Methods

when one has large number of samples drawn from required posterior distn…

it is not difficult to approximate intractable integrals!

( but, sampling is HARD! )


Thus, need “alternative MC methods”, such as “Importance Sampling (IS)”

By making this recursive, one can obtain “Sequential Importance Sampling (SIS)”


3-1. Perfect MC sampling

simulate \(N\) independent samples ( particles ) \(\left\{\mathbf{x}_{0: t}^{(i)} ; i=1, \ldots, N\right\}\) , according to \(p\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)\)

Estimate of \(p\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)\) :

  • \(P_{N}\left(d \mathbf{x}_{0: t} \mid \mathbf{y}_{0: t}\right)=\frac{1}{N} \sum_{i=1}^{N} \delta_{\mathbf{x}_{0: t}^{(i)}}\left(d \mathbf{x}_{0: t}\right)\).

    where \(\delta_{\mathbf{x}^{(i)}}\left(d \mathbf{x}_{0: t}\right)\) denotes the delta-Dirac mass located in \(\mathbf{x}_{0: t}^{(i)}\).


Following Estimate \(I\left(f_{t}\right)\) :

  • \(I_{N}\left(f_{t}\right)=\int f_{t}\left(\mathrm{x}_{0: t}\right) P_{N}\left(d \mathrm{x}_{0: t} \mid \mathbf{y}_{1: t}\right)=\frac{1}{N} \sum_{i=1}^{N} f_{t}\left(\mathrm{x}_{0: t}^{(i)}\right)\).


By LLN…\(I_{N}\left(f_{t}\right) \underset{N \rightarrow+\infty}{\stackrel{a . s}{\longrightarrow}} I\left(f_{t}\right)\)

  • [proof]

    \(\sigma_{f_{t}}^{2} \triangleq \mathbb{E}_{p\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)}\left[f_{t}^{2}\left(\mathbf{x}_{0: t}\right)\right]-I^{2}\left(f_{t}\right)<+\infty\).

    thus, \(\operatorname{var}\left(I_{N}\left(f_{t}\right)\right)=\frac{\sigma_{f t}^{2}}{N}\)

  • if \(\sigma_{f_{t}}^{2}<+\infty,\)

    \(\sqrt{N}\left[I_{N}\left(f_{t}\right)-I\left(f_{t}\right)\right] \underset{N \rightarrow+\infty}{\Longrightarrow} \mathcal{N}\left(0, \sigma_{f_{t}}^{2}\right)\).


BUT, it is usually impossible to sample EFFICIENTLY from \(p\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)\)

\(\therefore\) use MCMC! ( but MCMC are iterative algorithms, unsuited to recursive estimation problems )

Then..how?


3-2. Importance Sampling

Importance Sampling distribution

  • ( = proposal distn, importance function )
  • \(\pi\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)\).


\(I\left(f_{t}\right)=\frac{\int f_{t}\left(\mathbf{x}_{0: t}\right) w\left(\mathbf{x}_{0: t}\right) \pi\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right) d \mathbf{x}_{0: t}}{\int w\left(\mathbf{x}_{0: t}\right) \pi\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right) d \mathbf{x}_{0: t}}\).

  • where importance weight : \(w\left(\mathbf{x}_{0: t}\right)=\frac{p\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)}{\pi\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)}\)


if we are able to simulate \(N\) particles …. according to “\(\pi\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)\)”,

\(\widehat{I}_{N}\left(f_{t}\right)=\frac{\frac{1}{N} \sum_{i=1}^{N} f_{t}\left(\mathbf{x}_{0: t}^{(i)}\right) w\left(\mathbf{x}_{0: t}^{(i)}\right)}{\frac{1}{N} \sum_{j=1}^{N} w\left(\mathbf{x}_{0: t}^{(i)}\right)}=\sum_{i=1}^{N} f_{t}\left(\mathbf{x}_{0: t}^{(i)}\right) \widetilde{w}_{t}^{(i)}\).

  • where normalized importance weights are \(\widetilde{w}_{t}^{(i)}=\frac{w\left(\mathbf{x}_{0: t}^{(i)}\right)}{\sum_{j=1}^{N} w\left(\mathbf{x}_{0: t}^{(j)}\right)}\).


Therefore, posterior distribution approximation can become like…

  • (using SAMPLING instead of INTEGRATION)
  • \(\widehat{P}_{N}\left(d \mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)=\sum_{i=1}^{N} \widetilde{w}_{t}^{(i)} \delta_{\mathbf{x}_{0: t}^{(i)}}\left(d \mathbf{x}_{0: t}\right)\).


Conclusion

  • Importance Sampling = general MC integration method

  • but NOT ADEQUATE for recursive estimation

    ( one needs to get all \(y_{1:t}\) before estimating \(p(x_{0:t} \mid y_{1:t})\) )

    ( each time new data \(y_{t+1}\) comes in, need to recompute the importance weights over entire state )


3-3. Sequential Importance Sampling (SIS)

IS can be modified, so that…

  • “no need to modify past simulated trajectories” \(\left\{\mathbf{x}_{0: t-1}^{(i)} ; i=1, \ldots, N\right\}\)


Importance function \(\pi\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)\) :

\(\begin{aligned}\pi\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)&=\pi\left(\mathbf{x}_{0: t-1} \mid \mathbf{y}_{1: t-1}\right) \pi\left(\mathbf{x}_{t} \mid \mathbf{x}_{0: t-1}, \mathbf{y}_{1: t}\right)\\&=\pi\left(\mathbf{x}_{0}\right) \prod_{k=1}^{t} \pi\left(\mathbf{x}_{k} \mid \mathbf{x}_{0: k-1}, \mathbf{y}_{1: k}\right)\end{aligned}\).


Importance function allows us to evaluate recursively!

  • \(\widetilde{w}_{t}^{(i)} \propto \widetilde{w}_{t-1}^{(i)} \frac{p\left(\mathbf{y}_{t} \mid \mathbf{x}_{t}^{(i)}\right) p\left(\mathbf{x}_{t}^{(i)} \mid \mathbf{x}_{t-1}^{(i)}\right)}{\pi\left(\mathbf{x}_{t}^{(i)} \mid \mathbf{x}_{0: t-1}^{(i)}, \mathbf{y}_{1: t}\right)}\).


SOLUTION: adopt “prior distn” as “importance distn”

  • \(\pi\left(\mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)=p\left(\mathbf{x}_{0: t}\right)=p\left(\mathbf{x}_{0}\right) \prod_{k=1}^{t} p\left(\mathbf{x}_{k} \mid \mathbf{x}_{k-1}\right)\).

  • then, importance weight satisfies…

    \(\widetilde{w}_{t}^{(i)} \propto \widetilde{w}_{t-1}^{(i)} p\left(\mathbf{y}_{t} \mid \mathbf{x}_{t}^{(i)}\right)\).


Conclusion

  • SIS is an attractive…but just a constrained version of importance sampling
  • inefficient in high-dimensional spaces


3-4. The Bootstrap Filter

as \(t\) increases… \(\tilde{w_t}^{(i)}\) becomes skewed…

to avoid this, need to introduce an additional selection step


Key Idea :

  • 1) eliminate the particles having low importance weights \(\widetilde{w}_{t}^{(i)}\)
  • 2) multiply particles having high importance weights


(before) \(\widehat{P}_{N}\left(d \mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)=\sum_{i=1}^{N} \widetilde{w}_{t}^{(i)} \delta_{\mathbf{x}_{0: t}^{(i)}}\left(d \mathbf{x}_{0: t}\right)\)

(after) \(P_{N}\left(d \mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)=\frac{1}{N} \sum_{i=1}^{N} N_{t}^{(i)} \delta_{\mathbf{x}_{0: t}^{(i)}}\left(d \mathbf{x}_{0: t}\right)\)

  • where \(N_{t}^{(i)}\) is the number of offspring associated to \(\mathbf{x}_{0: t}^{(i)}\)

    ( such that \(\sum_{i=1}^{N} N_{t}^{(i)}=N\) )

  • How to choose \(N_{t}^{(i)}\) ?

    so that \(\int f_{t}\left(\mathbf{x}_{0: t}\right) P_{N}\left(d \mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right) \approx \int f_{t}\left(\mathbf{x}_{0: t}\right) \widehat{P}_{N}\left(d \mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)\)


After selection step, surviving \(\mathbf{x}_{0: t}^{(i)},\) is the ones with \(N_{t}^{(i)}>0\)

Many ways to select \(N_{t}^{(i)}\)

  • ex) sampling \(N\) times from (discrete) distn \(\widehat{P}_{N}\left(d \mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right)\)

    ( same as sampling \(N_{t}^{(i)}\) according to multinomial distn of params \(\tilde{w_t}^{(i)}\) )


Algorithm

figure2


\(\widetilde{w}_{t}^{(i)}\) does not appear!

  • because \(\mathbf{x}_{0: t-1}^{(i)}\) have uniform weights after resampling step at time \(t-1\)


Attractive Properties

  • 1) quick & easy to implement

  • 2) large extent modular

    ( = when changing the problem, only need to chance the expression for importance distn & weights )

  • 3) able to be implemented in parallel computer

Categories:

Updated: