ROSE: Register Assisted General Time Series Forecasting with Decomposed Frequency Learning

Abstract
Introduction
Relataed Works
1. TS Forecasting
2. TS Pre-training
Methodology
1. Architecture
2. Decomposed Frequency Learning
3. Time series register
4. Training
Experiments

0. Abstract

General TS forecasting models

pre-trained on a large number of TS datasets

Two challenges:

(1) How to obtain unified representations from multi-domian TS data
(2) How to capture domain specific features from TS data

ROSE

Register Assisted General Time Series Forecasting Model with Decomposed Frequency Learning

Novel pre-trained model for TS forecasting
Pretraining task: Decomposed Frequency Learning
- decomposes coupled semantic and periodic information in TS
  
  ( with frequency-based masking and reconstruction )
Time Series Register
- learns to generate a register codebook
  - to capture “domain-specific” representations during pretraining
- enhances “domain-adaptive” transfer
  - by selecting related register tokens on downstream tasks
- SOTA forecasting performance on 8 real-world benchmarks
  
  ( + few-shot scenarios )

1. Introduction

Challengs of applying general TS model to heterogeneous TS data

(1) Obtaining a unified representation
(2) Adaptive transferring information from multi-domain TS to specific downstream scenarios

(1) Obtaining a UNIFIED representation

TS from each domain = multiple frequency components ( = frequency superposition )
Frequency superposition
- leads to coupled semantic and periodic information in the time domain.
Large-scale TS data from different domains
- complex temporal patterns and frequency diversity

(2) ADAPTIVE transferring information from multi-domain TS

Multi-source TS data originate from various domains
- exhibit domain-specific information
Existing time series pre-training frameworks
- focus mainly on capturing generalized features during pre-training
- overlook domain-specific features
\(\rightarrow\) Necessary to learn domain-specific information during pre-training and adaptively transfer the specific representations to each target domain ( = adaptive transfer )

Two difficulties of adaptive transfer

(a) Capturing domain-specific information in pre-training
(b) Adaptive use of domain-specific information in various downstream tasks

ROSE

(1) Decomposed Frequency Learning

Learns generalized representations to solve the issue with coupled semantic and periodic information.
Step 1) Decomposition with Fourier transform
- With a novel frequency-based mask method
Step 2) Convert it back to the time domain
- Obtain decoupled TS for reconstruction.
\(\rightarrow\) Disentanglement … benefit the model to learn generalized representations

(2) Time Series Register (TS-Register)

Learns domain-specific information in multi-domain data
Register codebook
- [Pretraining] Generate register tokens to learn each domain-specific information
- [Downstream] Adaptively selects vectors from the register codebook that are close to the target domain of interest.
During fine-tuning, we incorporate learnable vectors into the selected register tokens to complement target specific information to perform more flexible adaptive transfer.

Contribution

ROSE = Novel general TS forecasting model
- using multi-domain datasets for pre-training
- improve downstream fine-tuning performance and efficiency.
Propose a novel Decomposed Frequency Learning
- employs multi-frequency masking to decouple complex temporal patterns
  
  \(\rightarrow\) empowers the model’s generalization capability
Propose TS-Register
- (pre-training) to capture domain-specific information
- (fine-tuning) enables adaptive transfer of target-oriented specific information for downstream tasks
Our experiments with 8 real-world benchmarks

(1) TS forecasting

pass

(2) TS pre-training

Multi-source TS pre-training

MOMENT [31] and MOIRAI [8]
- adopt a BERT-style pre-training approach
Timer [2] and PreDcT [32]
- use a GPT-style pre-training approach, giving rise to improved performance in TS prediction.

\(\rightarrow\) Above methods overlook domain-specific information from multisource data

ROSE

Pre-trains on large-scale data from various domains
Considers both generalized representations and domain-specific information

\(\rightarrow\) Facilitates flexible adaptive transfer in downstream tasks

3. Methodology

Problem Definition

Multivariate TS: \(\mathbf{X}_t=\left\{\mathbf{x}_{t-L: t}^i\right\}_{i=1}^C\), where \(\mathbf{x}_{t-L: t}^i \in \mathbb{R}^L\)

\(L\) : look-back window
\(C\) : number of channels

Forecasting task

Predict the future values \(\hat{\mathbf{Y}}_t=\left\{\hat{\mathbf{x}}_{t: t+F}^i\right\}_{i=1}^C\),
- \(F\): forecast horizon
\(\mathbf{Y}_t=\left\{\mathbf{x}_{t: t+F}^i\right\}_{i=1}^C\) : ground truth

General TS forecasting model

[Pre-train] with multi-source datasets \(\mathbf{D}_{\text {pre-train }}=\) \(\left\{\left(\mathbf{X}_t^j, \mathbf{Y}_t^j\right)\right\}_{j=1}^N\),
- where \(N\) is the number of datasets.
[Downstream task]
- fine-tuned with a training dataset \(\mathbf{D}_{\text {train }}=\left\{\left(\mathbf{X}_t^{\text {train }}, \mathbf{Y}_t^{\text {train }}\right)\right\}\)
- tested with \(\mathbf{D}_{\text {test }}=\left\{\left(\mathbf{X}_t^{\text {test }}, \mathbf{Y}_t^{\text {test }}\right)\right\}\) to predict \(\hat{\mathbf{Y}}_t^{\text {test }}\),
where \(\mathbf{D}_{\text {pre-train }}, \mathbf{D}_{\text {train }}\) and \(\mathbf{D}_{\text {test }}\) are pairwise disjoint.

( + Alternatively, the model could be directly tested using \(\mathbf{D}_{\text {test }}\) without fine-tuning with \(\mathbf{D}_{\text {train }}\) to predict \(\hat{\mathbf{Y}}_t^{\text {test }}\) )

(1) Architecture

TS forecasting paradigm of ROSE contains two steps

(1) Pre-training
- pre-trained on large-scale datasets from various domains, with
  - 1-1) Reconstruction
  - 1-2) Prediction
(2) Fine-tuning
- fine-tunes with a target dataset in the downstream scenario

Encoder-Decoder architecture
Backbone = Transformer
- effectively capture temporal dependencies
Channel-independent (CI)

a) Input Representations

To enhance the generalization of ROSE for adaptive transferring ..

\(\rightarrow\) model the inputs \(\mathbf{x}\) with patch tokens and register tokens

Two tokens

(1) Patch tokens
- obtained by partitioning the TS using patching layers
(2) Register tokens
- obtained by linear mapping and clustering each of the entire TS input into discrete embedding
- to capture domain-specific information

(2) Decomposed frequency learning

TS = composed of multiple combined frequency components

Resulting in overlapping of different temporal variations.
Low-frequency
- overall trends and longer-scale variations
High-frequency
- information about short-term fluctuations and shorter-scale variations

\(\rightarrow\) Propose decomposed frequency learning

based on multi-freq masking
to understand the TS from multiple-frequency perspectives
- which enhances the model’s ability to learn generalized representations

b) Multi-freq masking

\[\mathbf{x}_{\text {freq }}=\operatorname{rFFT}(\mathbf{x})\]

Input: \(\mathbf{x} \in \mathbb{R}^L\),
Real Fast Fourier Transform \((\mathrm{rFFT})\)
Output: \(\mathbf{x}_{\mathrm{freq}} \in \mathbb{R}^{L / 2}\).

To separately model high-frequency and low-frequency information …

Sample …

\(K_{\mathrm{f}}\) thresholds: \(\tau_1, \tau_2, \tau_3, \ldots, \tau_{K_{\mathrm{f}}}\)
- where \(\tau \in \operatorname{Uniform}(0, a), a<L / 2\),
\(K_{\mathrm{f}}\) random numbers: \(\mu_1, \mu_2, \mu_3, \ldots, \mu_{K_{\mathrm{f}}}\)
- where \(\mu \in \operatorname{Bernoulli}(1, p)\).

for multi-frequency masks

Each pair of \(\tau_i\) and \(\mu_i\) = \(i_{\text {th }}\) frequency mask

\(\rightarrow\) Generates a mask matrix \(\mathbf{M} \in\{0,1\}^{K_{\mathrm{f}} \times L / 2}\),

Row = \(i_{t h}\) frequency mask
Column = \(j_{t h}\) frequency
- mask=0: masked
- mask=1: unmasked

\(m_{i j}=\left\{\begin{array}{cl} \mu_i & , \text { if } \mathbf{x}_{\mathrm{freq}_j}<\tau_i \\ \left(1-\mu_i\right) & , \text { if } \mathbf{x}_{\mathrm{freq}_j}>\tau_i \end{array},\right.\).

\(\tau_i\) and \(\mu_i\) : threshold and random number for the \(i_{t h}\) frequency domain mask
\(\mathbf{x}_{\text {freq }_j}\) : \(j_{t h}\) frequency of \(\mathbf{x}_{\text {freq }}\).
- If \(\mu_i=1\) … mask HIGH frequency
- If \(\mu_i=o\) … mask LOW frequency

\(\mathbf{X}_{\text {mask }}=\operatorname{irFFT}\left(\mathbf{X}_{\text {freq }} \odot \mathbf{M}\right)\).

Step 1) After obtaining the mask matrix \(\mathbf{M}\), ….
Step 2) Replicate \(\mathbf{x}_{\text {freq }}\) of \(K_{\mathrm{f}}\) times

\(\rightarrow\) Get the \(\mathbf{X}_{\text {freq }} \in \mathbb{R}^{K_{\mathrm{f}} \times L / 2}\)
Step 3) Element-wise Hadamard product with the mask matrix \(\mathbf{M}\)

\(\rightarrow\) Get masked frequency of TS
Step 4) inverse Real Fast Fourier Transform (irFFT) to convert

\(\rightarrow\) Get \(K_{\mathrm{f}}\) masked sequences \(\mathbf{X}_{\text {mask }}=\left\{\mathbf{x}_{\text {mask }}^i\right\}_{i=1}^{K_{\mathrm{f}}}\),
- where each \(\mathbf{x}_{\text {mask }}^i \in \mathbb{R}^L\) corresponding to masking with a different threshold \(\tau_i\).

c) Representation learning

Step Divide each sequence \(\mathbf{x}_{\text {mask }}^i\) into \(P\) non-overlapping patches

\(\mathbf{x}_{\text {mask }}^i \in \mathbb{R}^L\).

Step 2) Patch Embedding … \(P\) patch tokens ( with linear layer)

\(\rightarrow\) Get \(\mathcal{X}_{\mathrm{mp}}=\left\{\mathbf{X}_{\mathrm{mp}}^i\right\}_{i=1}^{K_{\mathrm{f}}}\) to capture general information

where \(\mathbf{X}_{\mathrm{mp}}^i \in \mathbb{R}^{P \times D}\)

Step 3) Replicate the register tokens \(\mathbf{X}_{\mathrm{u}}\) of \(K_{\mathrm{f}}\) times

\(\rightarrow\) Get \(\mathcal{X}_{\mathrm{u}} \in \mathbb{R}^{K_{\mathrm{f}} \times N_{\mathrm{r}} \times D}\)

where \(\mathbf{X}_{\mathrm{u}} \in \mathbb{R}^{N_{\mathrm{r}} \times D}\) is obtained by inputting the original sequence into the TS-Register,

Step 4) Concatenate (a) & (b)

(a) Patch tokens \(\mathcal{X}_{\mathrm{mp}}\)
(b) Register tokens \(\mathcal{X}_{\mathrm{u}}\)

Step 5) Transformer encoder

Representation of each masked series.
Aggregated to yield a unified representation \(\mathbf{S} \in \mathbb{R}^{\left(N_{\mathrm{r}}+P\right) \times D}\) :

\(\left.\left.\mathbf{S}=\text { Aggregator(Encoder(Concatenate }\left(\mathcal{X}_{\mathrm{mp}}, \mathcal{X}_{\mathrm{u}}\right)\right)\right)\).

d) Reconstruction task

Feed \(\mathbf{S}\) into the reconstruction decoder

Reconstruct the original sequence \(\hat{\mathbf{x}} \in \mathbb{R}^L\)

Note thta frequency domain masking affects the overall TS, we compute the Mean Squared Error (MSE) reconstruction loss for the entire TS

\(\mathcal{L}_{\text {reconstruction }}= \mid \mid d\mathbf{x}-\hat{\mathbf{x}} \mid \mid d_2^2\).

(3) Time series register

By decomposed frequency learning…

\(\rightarrow\) we can obtain the domain-GENERAL representations

Additionally, propose the TS-Register for domain-SPECIFIC information

TS-Register

Step 1) Clusters domain-specific information into register tokens
Step 2) (Pre-training) Stores such domain-specific information in the register codebook
Step 3) (Fine-tuning) Adaptively selects similar information from the register codebook
- to enhance the target domain

Details

Randomly initialized register codebook \(\mathbf{E} \in \mathbb{R}^{H \times D_{\mathrm{r}}}\)
- with \(H\) cluster center vectors \(\mathbf{e}_i \in \mathbb{R}^{D_{\mathrm{r}}}, i \in\{1,2, \ldots, H\}\).
Projection: input TS \(\mathbf{x} \in \mathbb{R}^L\) \(\rightarrow\) embedding \(\mathbf{x}_{\mathrm{e}} \in \mathbb{R}^{D_{\mathrm{r}}}\)
- through a linear layer.

a) Pre-training stage.

Use the register codebook to cluster these embeddings

( which generate domain-specific information )
Store them in pre-training.
Find a cluster center vector \(\mathbf{e}_\delta\) from the register codebook \(\mathbf{E}\)
- \(\delta=\underset{j=1: H}{\arg \min } \mid \mid d\mathbf{x}_{\mathrm{e}}-\mathbf{e}_j \mid \mid d_2\).
To update the cluster center vectors … loss function:
- \(\mathcal{L}_{\text {register }}= \mid \mid d\operatorname{sg}\left(\mathbf{x}_{\mathrm{e}}\right)-\mathbf{e}_\delta \mid \mid d_2^2+ \mid \mid d\mathbf{x}_{\mathrm{e}}-\operatorname{sg}\left(\mathbf{e}_\delta\right)\mid \mid d_2^2\).
  - First term = update the register codebook \(\mathbf{E}\),
  - Second term = update the parameters of the linear layer that learns \(\mathbf{x}_{\mathrm{e}}\).

Summary: Vectors in the register codebook \(\mathbf{E}\)

cluster the embeddings of different data
learn the domain-specific centers for pre-trained datasets

( = represent domain-specific information )

\(\mathbf{X}_{\mathrm{u}} \in \mathbb{R}^{N_{\mathrm{r}} \times D}\) = register tokens

Cluster center vector \(\mathbf{e}_\delta\) is then patched into register tokens
- \(N_{\mathrm{r}}\): # of the register tokens
Used as the prefix of the patch tokens \(\mathbf{X}_{\mathrm{p}} \in \mathbb{R}^{P \times D}\)
- provide domain-specific information.

b) Fine-tuning stage

Freeze the register parameters

to adaptively use this domain-specific information

top-k strategy

Selects top-k similar vectors in the register codebook
Uses their average as \(\overline{\mathbf{e}}_k\)
- \(\rightarrow\) also patched into \(\mathbf{X}_{\mathrm{d}} \in \mathbb{R}^{N_{\mathrm{r}} \times D}\)

\(\overline{\mathbf{e}}_k=\frac{1}{k} \sum_{i=1}^k \mathbf{e}_{\delta_i},\{\delta_1, \cdots, \delta_k\}=\underset{j=1: H}{\arg \operatorname{Topk}}(\frac{1}{\mid \mid d\mathbf{x}_{\mathrm{e}}-\mathbf{e}_j \mid \mid d_2}) .\).

Additionally set a matrix \(\mathbf{A} \in \mathbb{R}^{N_{\mathrm{r}} \times D}\) to adjust \(\mathbf{X}_{\mathrm{d}}\)

\(\rightarrow\) to complement the specific information of downstream data.

\(\mathbf{A}\) is set as a low-rank matrix:

\(\mathbf{A}=\mathbf{u} \times \mathbf{v}^{\mathrm{T}}\).
- where \(\mathbf{u} \in \mathbb{R}^{N_{\mathrm{r}}}\) and \(\mathbf{v} \in \mathbb{R}^D\),
only the vectors \(\mathbf{u}\) and \(\mathbf{v}\) need to be retrained in the fine-tuning step

\(\mathbf{X}_{\mathrm{r}}=\mathbf{X}_{\mathrm{d}} \odot \mathbf{A}\).
- \(\mathbf{X}_{\mathrm{d}}\) = domain-specific information obtained at the pre-train stage
- \(\mathbf{A}\) = downstream dataset-specific information.

(4) Training

Co-train ..

(1) Supervised prediction
(2) Self-supervised reconstruction

Why (1)?

Uses multi-frequency mask to learn unified features that are more applicable to the downstream prediction task.

a) Prediction task (SL)

Notation

Input TS: \(\mathbf{x} \in \mathbb{R}^L\)
Spliced Input TS & Mapped : \(\mathbf{X}_{\mathrm{p}} \in \mathbb{R}^{P \times D}\)

Four prediction heads

mapping to prediction lengths of \(\{96,192,336,720\}\)

Concatenation

(1) Patch tokens \(\mathbf{X}_{\mathrm{p}}\)
(2) Register tokens \(\mathbf{X}_{\mathrm{u}}\)

\(\rightarrow\) [(1), (2)] Feed into the transformer encoder, prediction decoder, and prediction heads to obtain four prediction results \(\hat{\mathbf{Y}}_F\), where \(F \in\) \(\{96,192,336,720\}\).

\(\mathcal{L}_{\text {prediction }}=\sum_{F \in\{96,192,336,720\}} \mid \mid d\mathbf{Y}_F-\hat{\mathbf{Y}}_F \mid \mid d_2^2\).

b) Pre-training

Learns generalized features

\(\rightarrow\) Parameters of the reconstruction decoder are copied to the prediction decoder

( To avoid prediction training affecting the generalization performance of the model, the gradients of the prediction heads are skipped at back-propagation )

Overall pre-training loss

\(\mathcal{L}_{\text {pre-train }}=\mathcal{L}_{\text {reconstruction }}+\mathcal{L}_{\text {prediction }}+\mathcal{L}_{\text {register }}\).

ROSE; Register Assisted General Time Series Forecasting with Decomposed Frequency Learning

Seunghan Lee

ROSE: Register Assisted General Time Series Forecasting with Decomposed Frequency Learning

Contents

0. Abstract

ROSE

1. Introduction