ScoreGrad: Multivariate Probabilistic Time Series Forecasting with Continuous Energy-based Generative Models

Abstract
Introduction
Related Work
Score-based generative models
Method
Experiment

Abstract

Generative models in TS

Many existing works can not be widely used because of the…
- (1) constraints of functional form of generative models
- (2) sensitivity to hyperparameters

ScoreGrad

MTS probabilistic forecasting
Continuous energy-based generative models
Composed of ..
- (1) TS feature extraction module
- (2) Conditional stochastic differential equation based score matching module
Prediction: by iteratively solving reverse-time SDE

( any numerical solvers for SDEs can be used for sampling )
First continuous energy based generative model used for TS forecasting

1. Introuction

Limitations of previous works

[3], [4] : can not model stochastic information in TS
[5], [6] : can not model long range time dependencies

Generative models

proved effective for sequential modeling
WaveNet [7] : proposes TCN with dilation
[8] : Combines transformer + masked autoregressive flow together
- however, functional form the fothe models based on VAE & flow based models are constrained
TimeGrad [9] : uses an energy-based generative model (EBM) which transforms data distribution to target distribution by slowly injecting noise

TimeGrad

less restrictive on functional forms ( compared with VAE and flow based models )
Limitations
- (1) DDPM in TimeGrad is sensitive to the noise scales
- (2) Number of steps used for noise injection needs to be carefully designed
- (3) Sampling methods for generation process in DDPM can be further extended

ScoreGrad

General framework for MTS forecasting based on continuous energy-based generative models
DDPM = Discrete form of a stochastic differential equation (SDE)

\(\rightarrow\) number of steps can be replaced by the interval of integration

\(\rightarrow\) noise scales can be easily tuned by diffusion term in SDE

\(\rightarrow\) conditional continuous energy based generative model + sequential modeling for forecasting!

(1) MTS forecasting

pass

(2) Energy-based generative models

Energy-based models (EBMs)

Un-normalized probabilistic models
Compared with VAE & flow-based generative models…
- EBMs directly estimate the unnormalized negative log-probability
- Do not place a restriction on the tractability of the normalizing constants
\(\rightarrow\) Much less restrictive in functional form
Limitations: the unknown normalizing constant of EBMs will make training particularly difficult

Several methods for training EBMS

(1) Maximum likelihood estimation based on MCMC
- [26] : estimate the gradient of the log-likelihood with MCMC sampling methods, instead of directly computing the likelihood.
(2) Score matching based methods
- [28] : minimize a discrepancy between the gradient of the loglikelihood of data distribution and estimated distribution with Fisher divergence
  - optimization is computationally expensive!
(3) Noise contrastive estimation
- [29] : EBM can be learned by contrasting it with known density.

3. Score-based generative models

(1) Score matching models

\(\mathbf{x} \sim p_{\mathcal{X}}(\mathbf{x})\) : the distribution of a \(\mathrm{D}\) dim dataset

Score matching

EBM which is proposed for learning “non-normalized” statistical models
Aims to minimize the distance of the derivatives of the log-density function (= score) btw data and model distributions.

Although the density function of data distribution is unknown ..

Thanks to simple trick of partial integration.

\(\begin{aligned} L(\theta) &=\frac{1}{2} \mathbb{E}_{p_{\mathcal{X}}(\mathbf{x})} \mid \mid \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x})-\nabla_{\mathbf{x}} \log p_{\mathcal{X}}(\mathbf{x}) \mid \mid _2^2 \\ & \quad=\mathbb{E}_{p_{\mathcal{X}}(\mathbf{x})}\left[\operatorname{tr}\left(\nabla_{\mathbf{x}}^2 \log p_\theta(\mathbf{x})\right)+\frac{1}{2} \mid \mid \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}) \mid \mid _2^2\right]+\mathrm{const} \end{aligned}\).

\(\nabla_{\mathbf{x}} \log p_\theta(\mathbf{x})\) : score function

(2) Discrete score matching models

Two classes of EBMs that use various levels of noise to estimate score network

(1) Score matching with Langevin dynamics (SMLD)

To improve score-based generative modeling by perturbing data with various levels of noise
Trains a Noise Conditioned Score Network (NCSN) : \(s_\theta(\mathbf{x}, \sigma)\)
- to estimate scores corresponding to all noise levels.
Perturbation kernel : \(p_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x}):=\mathcal{N}\left(\tilde{\mathbf{x}} ; \mathbf{x}, \sigma^2 \mathbf{I}\right)\).
- noise sequence with ascending order \(\left\{\sigma_1, \sigma_2, \cdots, \sigma_N\right\}\), where…
  - \(\sigma_1\) is small enough that \(p_{\sigma_1}(\mathbf{x}) \approx p_{\mathcal{X}}(\mathbf{x})\)
  - \(\sigma_N\) is large enough that \(p_{\sigma_N}(\mathbf{x}) \approx \mathcal{N}\left(\mathbf{0}, \sigma_N^2 \mathbf{I}\right)\).

[ Training ]

Loss function: weighted sum of Denoising score matching objective

\(L_\theta=\operatorname{argmin}_\theta \sum_{i=1}^N \mathbb{E}_{p_{\mathcal{X}}} \mathbb{E}_{p_{\sigma_i}}(\tilde{\mathbf{x}} \mid \mathbf{x})\left[ \mid \mid s_\theta\left(\tilde{\mathbf{x}}, \sigma_i\right)-\nabla_{\tilde{\mathbf{x}}} \log p_{\sigma_i}(\tilde{\mathbf{x}} \mid \mathbf{x}) \mid \mid _2^2\right]\).

[ Generation / Sampling ]

Langevin MCMC is used for iterative sampling
Number of iteration steps : \(\mathbf{M}\)
Sampling process for \(p_{\sigma_i}(\mathbf{x})\) :
- \(\mathbf{x}_i^m=\mathbf{x}_i^{m-1}+\epsilon_i s_\theta\left(\mathbf{x}_i^{m-1}, \sigma_i\right)+\sqrt{2 \epsilon_i} \mathbf{z}_i^m, m=1,2, \cdots, M\).
  - where \(\mathbf{z}_i^m \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \epsilon_i\) is the step size.
Iteration sampling process is repeated for \(\mathrm{N}\) steps
- \(N\) = number of different noises
- \(\mathbf{x}_N^0 \sim\) \(\mathcal{N}\left(\mathbf{x} \mid \mathbf{0}, \sigma_N^2 \mathbf{I}\right)\) and \(\mathbf{x}_i^0=\mathbf{x}_{i+1}^M\) when \(i<N\).

(2) Denoising diffusion probabilistic models (DDPM)

[ Training ]

Forward process: \(p\left(\mathbf{x}_i \mid \mathbf{x}_{i-1}\right) \sim \mathcal{N}\left(\mathbf{x}_i ; \sqrt{1-\beta_i} \mathbf{x}_{i-1}, \beta_i \mathbf{I}\right)\)

\(\rightarrow\) \(p\left(\mathbf{x}_i \mid \mathbf{x}_0\right) \sim\) \(\mathcal{N}\left(\mathbf{x}_i ; \sqrt{\alpha_i} \mathbf{x}_0,\left(1-\alpha_i\right) \mathbf{I}\right)\),

where \(\alpha_i=\prod_{k=1}^i\left(1-\beta_k\right)\).

Reverse process: \(q\left(\mathbf{x}_{i-1} \mid \mathbf{x}_i\right) \sim \mathcal{N}\left(\mathbf{x}_{i-1} ; \frac{1}{\sqrt{1-\beta_i}}\left(\mathbf{x}_i+\beta_i s_\theta\left(\mathbf{x}_i, i\right)\right), \beta_i \mathbf{I}\right)\)

Loss function: ELBO

\(L(\theta)=\operatorname{argmin}_\theta \sum_{i=1}^N\left(1-\alpha_i\right) \mathbb{E}_{p_{\mathcal{X}}(\mathbf{x})} \mathbb{E}_{p_{\alpha_i}(\tilde{\mathbf{x}} \mid \mathbf{x})}\left[ \mid \mid s_\theta(\tilde{\mathbf{x}}, i)-\nabla_{\tilde{\mathbf{x}}} \log p_{\alpha_i}(\tilde{\mathbf{x}} \mid \mathbf{x}) \mid \mid _2^2\right]\).

[ Generation / Sampling ]

Based on the inverse Markov chain ( called ancestral sampling )

\(\mathbf{x}_{i-1}=\frac{1}{\sqrt{1-\beta_i}}\left(\mathbf{x}_i+\beta_i s_\theta\left(\mathbf{x}_i, i\right)\right)+\sqrt{\beta_i} \mathbf{z}_i\).

where \(i=N, N-1, \cdots, 1\)

(3) Score matching with SDEs

Noise involvement process of above two methods can be modeled as numerical form of stochastic differential equations (SDE)

SDE: \(d \mathbf{x}=f\left(\mathbf{x}, t_s\right) d t_s+g\left(t_s\right) d \mathbf{w}\) ….. Eq (a)

where \(\mathbf{w}\) represents a standard Wiener process
\(f\left(\mathbf{x}, t_s\right)\) : drift coefficient
\(g\left(t_s\right)\) : scalar function called diffusion coefficient

( [34] indicates that the reverse process of Eq (a) also satisfies a SDE as shown in Eq (b) )

\(d \mathbf{x}=\left[f\left(\mathbf{x}, t_s\right)-g\left(t_s\right)^2 \nabla_x \log p_{t_s}(\mathbf{x})\right] d t_s+g\left(t_s\right) d \mathbf{w}\) …… Eq (b)

\(\rightarrow\) SDE can be reversed if \(\nabla_{\mathbf{x}} \log p_t(\mathbf{x})\) at each intermediate time step is known!!!

Also, forward process of the above two models (1) & (2)

(1) Discrete score matching models
(2) DDPMs

can be treated as DISCRETE form of continuous-time SDEs

(1) Discrete score matching models: \(p_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x}):=\mathcal{N}\left(\tilde{\mathbf{x}} ; \mathbf{x}, \sigma^2 \mathbf{I}\right)\)

can be seen as DISCRETE version of process \(\left\{\mathbf{x}\left(t_s\right)\right\}_{t_s=0}^1\)

such that \(d \mathbf{x}=\sqrt{\frac{d\left[\sigma^2\left(t_s\right)\right]}{d t_s}} d \mathbf{w}\)

\(\rightarrow\) gives a process with exploding variance when \(t \rightarrow \infty\)
called Variance Exploding (VE) SDE.

(2) DDPMs: \(p\left(\mathbf{x}_i \mid \mathbf{x}_{i-1}\right) \sim \mathcal{N}\left(\mathbf{x}_i ; \sqrt{1-\beta_i} \mathbf{x}_{i-1}, \beta_i \mathbf{I}\right)\)

when \(N \rightarrow \infty\), Eq. 5 converges to
\[d \mathbf{x}=-\frac{1}{2} \beta\left(t_s\right) \mathbf{x} d t_s+\sqrt{\beta\left(t_s\right)} d \mathbf{w}\]
called Variance Preserving (VP) SDE
- because the variance \(\boldsymbol{\Sigma}(t)\) is always bounded given \(\boldsymbol{\Sigma}(0)\).

4. Method

(1) Symbol and Problem formulation

MTS: \(\mathcal{X}=\left\{\mathbf{x}_1^0, \mathbf{x}_2^0, \cdots, \mathbf{x}_T^0\right\}\),

Prediction tasks for MTS: estimate \(q_{\mathcal{X}}\left(\mathbf{x}_{t_0: T}^0 \mid \mathbf{x}_{1: t_0-1}^0, \mathbf{c}_{1: T}\right)\),

where \(\mathbf{c}_{1: T}\) represents covariates ( known for all time points )

Iterative forecasting strategy: \(q_{\mathcal{X}}\left(\mathbf{x}_{t_0: T}^0 \mid \mathbf{x}_{1: t_0-1}^0, \mathbf{c}_{1: T}\right)=\prod_{t=t_0}^T q_{\mathcal{X}}\left(\mathbf{x}_t^0 \mid \mathbf{x}_{t-1}^0, \mathbf{c}_{1: T}\right)\)

(2) Model Architecture

Two modules at each time step

(1) Time series feature extraction module
(2) Conditional SDE based score matching module

a) TS feature extraction module

Goal: get a feature \(\mathbf{F}_t\) of historical time series data until \(t-1\)

Update function : \(\mathbf{F}_t=R\left(\mathbf{F}_{t-1}, \mathbf{x}_{t-1}, \mathbf{c}_{t-1}\right)\)

ex) RNN: \(F_t\) corresponds to hidden state \(\mathbf{h}_{t-1}\)
ex) TCN, attention: \(\mathbf{F}_t\) is a vector that represents the features learned by historical data and covariates

Iterative forecasting strategy = conditional prediction problem

\(\prod_{t=t}^T q_{\mathcal{X}}\left(\mathbf{x}_t^0 \mid \mathbf{x}_{t-1}^0, \mathbf{c}_{1: T}\right)=\prod_{t=t}^T p_\theta\left(\mathbf{x}_t^0 \mid \mathbf{F}_t\right)\).
- \(\theta\) : learnable parameters of \(R\)
  
  ( use RNN for ScoreGrad )

b) Conditional SDE based score matching module

\(\mathbf{F}_t\) : used as a conditioner for SDE based score matching models at each time step.

Initial state distribution : \(p\left(\mathbf{x}_t^0 \mid \mathbf{F}_t\right)\)

Forward evolve process :

\(d \mathbf{x}=f\left(\mathbf{x}, t_s\right) d t_s+g\left(t_s\right) d \mathbf{w}\).

Reverese timeSDE:

\(d \mathbf{x}_t=\left[f\left(\mathbf{x}_t, t_s\right)-g\left(t_s\right)^2 \nabla_x \log p_{t_s}\left(\mathbf{x}_t \mid \mathbf{F}_t\right)\right] d t_s+g\left(t_s\right) d \mathbf{w}\).

c) Conditional score network

( Inspired by WaveNet & DiffWave )

d) Training

\[\begin{aligned} & L_t(\theta)=\operatorname{argmin}_\theta \mathbb{E}_{t_s}\left(\lambda\left(t_s\right) \mathbb{E}_{\mathbf{x}_t^0} \mathbb{E}_{\mathbf{x}_t^{t_s} \mid \mathbf{x}_t^0}\right. \\ & \left.\quad\left[ \mid \mid s_\theta\left(\mathbf{x}_t^{t_s}, \mathbf{F}_t, t_s\right)-\nabla_{\mathbf{x}_t^{t_s}} \log p_{0 t_s}\left(\mathbf{x}_t^{t_s} \mid \mathbf{x}_t^0\right) \mid \mid _2^2\right]\right) \end{aligned}\]

\(L(\theta)=\frac{1}{T} \sum_{t=1}^T L_t(\theta)\).

e) Prediction

5. Experiment

Twitter Facebook LinkedIn

ScoreGrad; Multivariate Probabilistic Time Series Forecasting with Continuous Energy-based Generative Models

Seunghan Lee

ScoreGrad: Multivariate Probabilistic Time Series Forecasting with Continuous Energy-based Generative Models

Contents

Abstract

ScoreGrad

1. Introuction

TimeGrad

ScoreGrad

(1) MTS forecasting

(2) Energy-based generative models

3. Score-based generative models

(1) Score matching models

(2) Discrete score matching models

(1) Score matching with Langevin dynamics (SMLD)

(2) Denoising diffusion probabilistic models (DDPM)

(3) Score matching with SDEs

4. Method

(1) Symbol and Problem formulation

(2) Model Architecture

a) TS feature extraction module

b) Conditional SDE based score matching module

c) Conditional score network

d) Training

e) Prediction

5. Experiment

You May Also Enjoy

ScoreGrad; Multivariate Probabilistic Time Series Forecasting with Continuous Energy-based Generative Models

Seunghan Lee

ScoreGrad: Multivariate Probabilistic Time Series Forecasting with Continuous Energy-based Generative Models

Contents

Abstract

ScoreGrad

1. Introuction

TimeGrad

ScoreGrad

2. Related Work

(1) MTS forecasting

(2) Energy-based generative models

3. Score-based generative models

(1) Score matching models

(2) Discrete score matching models

(1) Score matching with Langevin dynamics (SMLD)

(2) Denoising diffusion probabilistic models (DDPM)

(3) Score matching with SDEs

4. Method

(1) Symbol and Problem formulation

(2) Model Architecture

a) TS feature extraction module

b) Conditional SDE based score matching module

c) Conditional score network

d) Training

e) Prediction

5. Experiment

You May Also Enjoy