Retrieval-Augmented Diffusion Models for Time Series Forecasting

Abstract
Introduction
Related Work
Preliminary
Method
Experiments

0. Abstract

Factors limiting TS Diffusion

(1) Insufficient TS datasets
(2) Absence of guidance

RATD (Retrieval Augmented Time series Diffusion model )

Consists of 2 pars

(1) Embedding-based retrieval process
(2) Reference-guided diffusion model

(1) Embedding-based retrieval process

Retrieves the TS that are most relevant to historical TS from the database as references.

(2) Reference-guided diffusion model

Uses the references to guide the denoising process

1. Introduction

TS forecasting tasks = Conditional generation tasks

Conditional generative models: learn \(P\left(\boldsymbol{x}^P \mid \boldsymbol{x}^H\right)\)

Current SOTA: Diffusion models

Limitation of TS Diffusion

Remains unstable in certain scenarios ( Figure 1.c )
Why?
- (1) Most TS lack direct semantic or label correspondences
  
  \(\rightarrow\) Lack meaningful guidance during the generation (reverse)
- (2) Two shortcomings of the TS datasets
  - 2-1) Insufficient
  - 2-2) Imbalanced

Propose RATD (Retrieval Augmented Time series Diffusion model )

Contributions

(1) Propose Retrieval Augmented Time series Diffusion (RATD)
- Utilization of the dataset and providing meaningful guidance in the denoising process.
(2) Propose Reference Modulated Attention (RMA) module
- To provide reasonable guidance from the reference
- w/o introducing excessive additional computational costs
(3) Experiments on five real-world datasets

(1) TS Diffusion

TimeGrad [28]

Conditional diffusion model was first employed
- Autoregressive approach
- Denoising process guided by the hidden state

CSDI [36]

Non-autoregressive generation strategy

SSSD [1]

Replaced the noise-matching network with a SSSM

TimeDiff [30]

Incorporated future mix-up and autoregressive initialization into a non-autoregressive framework

MG-TSD [9]

Utilized a multi-scale generation strategy
To sequentially predict the main components and details of the TS.

mr-diff [31]

Utilized diffusion models to separately predict the trend and seasonal components of TS

\(\rightarrow\) Often perform poorly in challenging prediction tasks.

(2) RAG (Retrieval Augmented Generation)

RAG:

One of the classic mechanisms for generative models.
Incorporating explicit retrieval steps into NN

NLP & CV domain

[NLP] Leverage retrieval augmentation mechanisms to enhance the quality of language generation

[CV] Focus on utilizing samples from the database to generate more realistic images

TS domain

a) [3]

Employed memorized similarity information from training data for retrieval

b) MQ-ReTCNN [40]

Specifically designed for complex TS forecasting tasks involving multiple entities and variables.

c) ReTime [13]

Creates a relation graph based on the temporal closeness between sequences
Employs relational retrieval instead of content-based retrieval.

The proposed method still holds significant advantages compared to a)–c).

= References can repeatedly influence the generation

\(\rightarrow\) Allowing references to exert a stronger influence on the entire conditional generation process.

3. Preliminary

Notation

” \(s\) “ =Time step
” \(t\) “ = Step in the diffusion process.

Generative TS Forecasting

Notation

Historical TS: \(\boldsymbol{x}^H=\) \(\left\{s_1, s_2, \cdots, s_l \mid s_i \in \mathbb{R}^d\right\}\),
Target: \(\boldsymbol{x}^P = \left\{s_{l+1}, s_{l+2}, \cdots, s_{l+h} \mid s_{l+i} \in \mathbb{R}^{d^{\prime}}\right\}\left(d^{\prime} \leq d\right)\)

Learn a density \(p_\theta\left(\boldsymbol{x}^P \mid \boldsymbol{x}^H\right)\) that best approximates \(p\left(\boldsymbol{x}^P \mid \boldsymbol{x}^H\right)\),

\(\min _{p_\theta} D\left(p_\theta\left(\boldsymbol{x}^P \mid \boldsymbol{x}^H\right) \mid \mid p\left(\boldsymbol{x}^P \mid \boldsymbol{x}^H\right)\right),\).

Full TS: \(\left\{s_1, s_2, \cdots, s_{n+h}\right\}=\left[\boldsymbol{x}^H, \boldsymbol{x}^P\right]\).

Conditional TS Diffusion Models

Forward process:

\(\begin{aligned} q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right) & :=\mathcal{N}\left(\boldsymbol{x}_t ; \sqrt{1-\beta_t} \boldsymbol{x}_{t-1}, \boldsymbol{x}^H, \beta_t \boldsymbol{I}\right) \\ q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_0\right) & :=\mathcal{N}\left(\boldsymbol{x}_t ; \sqrt{\bar{\alpha}_t} \boldsymbol{x}_0, \boldsymbol{x}^H,\left(1-\bar{\alpha}_t\right) \boldsymbol{I}\right) \end{aligned}\).

Reverse process:

\(p_\theta\left(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_t, \boldsymbol{x}^H\right):=\mathcal{N}\left(\boldsymbol{x}_{t-1} ; \mu_\theta\left(\boldsymbol{x}_t\right), \Sigma_\theta\left(\boldsymbol{x}_t\right), \boldsymbol{x}^H\right)\).

Loss function:

\(\mathcal{L}\left(\boldsymbol{x}_0\right)=\sum_{t=1}^T \underset{q\left(\boldsymbol{x}_t \mid \mid \boldsymbol{x}_0 \mid \mid \boldsymbol{x}^H\right)}{\mathbb{E}} \mid \mid \mu_\theta\left(\boldsymbol{x}_t, t \mid \boldsymbol{x}^H\right)-\hat{\mu}\left(\boldsymbol{x}_t, \boldsymbol{x}_0 \mid \boldsymbol{x}^H\right) \mid \mid ^2\).

4. Method

(1) Overview

(2) Constructing Retrievl DB for TS

Strategy for constructing DB

Use two different definitions of DB

Definition (1) Entire training set

\(\mathcal{D}^{\mathcal{R}}:=\left\{\boldsymbol{x}_i \mid \forall \boldsymbol{x}_i \in \mathcal{D}^{\text {train }}\right\}\).
- where \(\boldsymbol{x}_i=\left\{s_i, \cdots, s_{i+l+h}\right\}\)

Definition (2) Subset containing samples from all categories

\(\mathcal{D}^{R^{\prime}}=\left\{\boldsymbol{x}_i^c, \cdots, \boldsymbol{x}_q^c \mid \forall c \in \mathcal{C}\right\}\).
- where \(x_i^k\) is the \(i\)-th sample in the \(k\)-th class of the training set
- \(\mathcal{C}\) is the category set of the original dataset

For brevity, we represent both databases as \(\mathcal{D}^R\).

(3) Retrieval-Augmented Time Series Diffusion

a) Embedding-Based Retrieval Mechanism

Ideal references \(\left\{s_i, \cdots, s_{i+h}\right\}\)

= Samples where preceding \(n\) points \(\left\{s_{i-n}, \cdots, s_{i-1}\right\}\) is most relevant to the historical TS \(\left\{s_j, \cdots, s_{j+n}\right\}\) in the \(\mathcal{D}^R\).

Quantify the reference between TS using the distance between their embeddings

Pre-trained encoders \(E_\phi\) a
- Trained on representation learning tasks
- Parameter set \(\phi\) is frozen
\(\mathcal{D}_{\mathrm{emb}}^R=\left\{\left\{i, E_\phi\left(\boldsymbol{x}_{[0: n]}^i\right), \boldsymbol{x}_{[n: n+h]}^i\right\} \mid \forall \boldsymbol{x}^i \in \mathcal{D}^R\right\}\).
- where \([p: q]\) refers to the subsequence formed by the \(p\)-th point to the \(q\)-th point in the TS
\(\boldsymbol{v}^H=E_\phi\left(\boldsymbol{x}^H\right)\).

Distance between

(1) \(\boldsymbol{v}^H\)
(2) all embeddings in \(\mathcal{D}_{\text {emb }}^R\)

Retrieve the references corresponding to the \(k\) smallest distances.

\(\begin{aligned} & \operatorname{index}\left(\boldsymbol{v}^H\right)=\underset{\boldsymbol{x}^i \in \mathcal{D}_{\text {emb }}^R}{\arg \min } \mid \mid \boldsymbol{v}^H-E_\phi\left(\boldsymbol{x}_{[0: n]}^i\right) \mid \mid ^2 \\ & \boldsymbol{x}^R=\left\{\boldsymbol{x}_{[n: n+h]}^j \mid \forall j \in \operatorname{index}\left(\boldsymbol{v}^H\right)\right\} \end{aligned}\).

We obtain a subset \(\boldsymbol{x}^R\) of \(\mathcal{D}^R\) based on a query \(\boldsymbol{x}^H\),

\(\zeta_k: \boldsymbol{x}^H, \mathcal{D}^R \rightarrow \boldsymbol{x}^R\), where \(\mid \mid \boldsymbol{x}^R \mid \mid =k\).

b) Reference-Guided TS Diffusion Model

Reverse process

Infer the posterior distribution \(p\left(\boldsymbol{z}^{t a r} \mid \boldsymbol{z}^c\right)\)

\(p\left(\boldsymbol{x} \mid \boldsymbol{x}^H\right)=\int p\left(\boldsymbol{x}_T \mid \boldsymbol{x}^H\right) \prod_{t=1}^T p_\theta\left(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_t, \boldsymbol{x}^H, \boldsymbol{x}^R\right) \mathcal{D} \boldsymbol{x}_{1: T},\).

where \(p\left(\boldsymbol{x}_T \mid \boldsymbol{x}^H\right) \approx \mathcal{N}\left(\boldsymbol{x}_T \mid \boldsymbol{x}^H, \boldsymbol{I}\right), p_\theta\left(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_t, \boldsymbol{x}^H, \boldsymbol{x}^R\right)\)
Assumption: \(p_\theta\left(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_t, \boldsymbol{x}\right)=\mathcal{N}\left(\boldsymbol{x}_{t-1} ; \mu_\theta\left(\boldsymbol{x}_t, \boldsymbol{x}^H, \boldsymbol{x}^R, t\right), \Sigma_\theta\left(\boldsymbol{x}_t, \boldsymbol{x}^H, \boldsymbol{x}^R, t\right)\right)\).

c) Training Procedure

Loss at time step \(t-1\) :

\(\begin{aligned} L_{t-1}^{(x)} & =\frac{1}{2 \tilde{\beta}_t^2} \mid \mid \mu_\theta\left(\boldsymbol{x}_t, \hat{\boldsymbol{x}}_0\right)-\hat{\mu}\left(\boldsymbol{x}_t, \hat{\boldsymbol{x}}_0\right) \mid \mid ^2 \\ & =\gamma_t \mid \mid \boldsymbol{x}_0-\hat{\boldsymbol{x}}_0 \mid \mid \end{aligned}\).

where \(\hat{\boldsymbol{x}}_0\) are predicted from \(\boldsymbol{x}_t\),
and \(\gamma_t=\frac{\bar{\alpha}_{t-1} \beta_t^2}{2 \tilde{\beta}_t^2\left(1-\bar{\alpha}_t\right)^2}\)

5. Experiments

Twitter Facebook LinkedIn

Retrieval-Augmented Diffusion Models for Time Series Forecasting

Seunghan Lee

Retrieval-Augmented Diffusion Models for Time Series Forecasting

Contents

0. Abstract

RATD (Retrieval Augmented Time series Diffusion model )