TimePFN: Effective Multivariate Time Series Forecasting with Synthetic Data

https://arxiv.org/pdf/2502.16294

Absract
Introduction
Related Works
Proposed Methods
1. PFN for MTS Forecasting
2. Synthetic MTS Data Generation
3. Architecture for TimePFN

Abstract

TimePFN

Task: MTS Forecasting
(1) Generating synthetic MTS data
(2) Novel MTS architecture
- Capturing both TD & CD across all input patches

Experiments

(1) **SoTA for MTS forecasting **
- zero-shot & few-shot
(2) Fine-tuning TimePFN
- with 500 data points \(\rightarrow\) nearly matches full dataset training error
(3) Strong UTS forecasting

1. Introduction

Existing setup: Train task == Test task

[Good] “Large”-scaled datasets
[Bad] “Small”-scaled datasets / “OOD” test set

TimePFN

MTS forecasting from a “data-centric” perspective

[Dataset] Generate realistic and diverse large-scale MTS data
[Architecture] Capable of extracting TS features from this large-scale synthetic dataset
- Allows for transfer learning to novel tasks with arbitrary number of channels

Contributions

**1. [Dataset] **

New method to generate “synthetic MTS”
- via “Gaussian processes” with kernel compositions and a linear coregionalization model

2. [Architecture]

Variation of PatchTST for MTS forecasting
- Incorporates channel mixing
- Employs a “convolutional embedding” for patch embeddings
\(\rightarrow\) Effectively extract cross-channel relations
First PFN for MTS forecasting
- Strong “few/zero-shot” performance
- (+ Strong UTS forecasting performance)

(1) TS Forecasting (TSF)

(1) Informer

ProbSparse attention
Quadratic complexity \(\rightarrow\) “Log-linear” complexity

(2) Fedformer

Uses sparsity of the TS in the “fourier domain”

(3) PatchTST

Tokenization: Patching with overlapping strides as a
CI: Each channel as univariate
Joint learning across all channels through the same set of shared weights

(4) iTransformer

1 variate = 1 token
Benefits of utilizing CD

(5) TimePFN (Proposed)

Deviates from PatchTST!
- (1) Incorporate convolutional layers before patching
- (2) Use channel-mixing to capture interactions between tokens from different channels

(2) Zero-shot TSF

(1) Chronos

Novel tokenization methods, employed quantization, and made TS data resemble language
Enable the training of LLM architectures for probabilistic univariate forecasting
Employs a data augmentation technique called KernelSynth
- Generates synthetic timeseries data using Gaussian processes

(2) ForecastPFN

Trained entirely on a synthetic dataset

(3) JEPA + PFNs

Integrates Joint-Embedding Predictive Architectures with PFNs for zero-shot forecasting

(4) Mamba4Cast

Trained entirely on synthetic data using the Mamba architecture as its backbone

(5) TimePFN (Proposed)

Introduces the first MTS PFN
Architecture that enables strong zero-shot and few-shot performances

3. Proposed Methods

Two key aspects

(1) Synthetic MTS generation (encapsulates TD & CD)
(2) Architecture capable of generalization to real datasets when trained on such a dataset

(1) PFN for MTS Forecasting

\(\mathcal{D} := \{t, \mathbf{X}_t\}_{t=1}^{T}\): \(N\)-channel MTS

\(\mathbf{X}_t := [x_{t,1}, \ldots, x_{t,N}]\).

Notation

Hypothesis space: \(\Omega\)
- with a prior distribution \(p(\omega)\)
Hypothesis: \(\omega \in \Omega\)
- Models a MTS generating process (e.g., \(\mathbf{X}_t = \omega(t)\))
Example)
- \(\Omega\): Space of hypotheses for VAR models
  - Particular instance \(\omega \in \Omega\) corresponds to a specific VAR process (e.g., VAR(2))
\(p(\cdot \mid T, \mathcal{D})\): Posterior predictive distribution (PPD) of \(\mathbf{x} \in \mathbb{R}^N\) at time \(T\)

\(\begin{equation} p(\mathbf{x} \mid T, \mathcal{D}) \propto \int_{\Omega} p(\mathbf{x} \mid T, \omega)\, p(\mathcal{D} \mid \omega)\, p(\omega)\, d\omega. \end{equation}\).

Posterior predictive distribution (PPD)

Approximated using PFNs

Procedures

Step 1) Iteratively sample a hypothesis \(\omega\)
- from the hypothesis space \(\Omega\)
- according to the probability \(p(\omega)\)
Step 2) Generate a prior dataset \(\mathcal{D}\) from this hypothesis
- \(\mathcal{D} \sim p(\mathcal{D} \mid \omega)\).
Step 3) Optimize the parameters of the PFN on these generated datasets
- Datasets
  - \(\mathcal{D}_{\text{input}} := \{t, \mathbf{X}_t\}_{t=1}^{\tilde{T}}\).
  - \(\mathcal{D}_{\text{output}} := \{t, \mathbf{X}_t\}_{t=\tilde{T}+1}^{T}\).
- Train the PFN to forecast \(\mathcal{D}_{\text{output}}\) from \(\mathcal{D}_{\text{input}}\) using standard models

[TimePFN] Hypothesis space \(\Omega\) : Single-input, multi-output Gaussian processes

represented by the linear model of coregionalization (LMC)

(2) Synthetic MTS Data Generation

Goal of MTS data generation

Goal 1) “Realistic” variates
Goal 2) “Correlated” variates

KernelSynth (feat. Chronos (2024))

Addresses Goal 1)
Enrich its training corpus by “randomly composing kernels” to generate diverse, synthetic UTS
- Kernels: Binary operators (e.g., addition and multiplication)
Aggregates kernels of various types with different parameters
- (various types) Linear, Periodic, Squared-Exponential, Rational, and Quadrati
- (parameters) daily, weekly, and monthly periodic kernels

Correlated Variables (Goal 2)?

[TimePFN] Generative Gaussian modelling

Linear model of coregionalization (LMC)
Outputs are obtained as linear combinations of independent latent random functions

Given \(t \in \mathbb{R}^T\), the outputs in each channel \(\{C_i(t)\}_{i=1}^{N}\) is the linear combination of \(L\) latent functions

\(C_i(t) = \sum_{j=1}^{L} \alpha_{i,j} \, l_j(t)\).
Latent functions are independent with zero-mean
Resulting ouptut covarince = PSD function with zero-mean

Convex combinations (to avoid scaling issues)

For each \(i\), \(\alpha_{i,1} + \cdots + \alpha_{i,L} = 1\) with \(\alpha_{i,j} \ge 0\)

LMC formulation: Encapsulates the cases where the correlations between different variates are small or nonexistent

e.g., independent variables: \(L = N\) with \(C_i(t) = l_i(t)\).

\(\rightarrow\) Such a modelling is important, as …

(1) Some MTS data have strong correlation
(2) Some MTS data have weak correlation

LMC-Synth

Sample the number of latent functions: from a Weibull distribution
Sample \([\alpha_{i,1}, \ldots, \alpha_{i,L}]\): from a Dirichlet distribution

(3) Architecture for TimePFN

Goal: Architecture to achieve better generalization when applied to real-world datasets

Primary advantage of the PFN framework ?

As, synthesizing large-scale synthetic MTS data is feasible with LMC-Synth

\(\rightarrow\) No longer constrained by data scarcity
Previous MTS models
- Compelled to balance model complexity vs. limited data
  
  \(\rightarrow\) Often restricting the use of certain components to avoid overfitting
TimePFN: Access to large-scale MTS data

\(\rightarrow\) Expand the architecture!!!

TimePFN

Resembles PatchTST
Differs significantly in two areas
- (1) Convolutional filtering of the variates (prior to patching)
- (2) Channel-mixing.

a) Convolutional Filtering

Done before patching!

Procedures

Step 1) Learnable 1D conv to each variate (shared weights)
Step 2) 1D magnitude max pooling to each newly generated variate

\(\rightarrow\) New set of 1D convolutions!

Notation

MTS dataset
- \(X = [x_1 . . . x_N ]\), with \(N\) variates with \(L\) length
Convolutional filtering: \(x_i \in \mathbb{R}^L\) \(\rightarrow\) \(\bar{x}_i \in \mathbb{R}^{(C+1)\times L}\),
- \(C\) rows: 1D conv + magnitude max pooling ( Use \(C = 9\) in practice )
- \(1\) row: Original \(x_i\).
Convolutions is a valuable tool!
- e.g., differencing to de-trend data can be effectively represented by convolutions

b) Patch Embeddings

Input: \(\bar{x}_i \in \mathbb{R}^{(C+1)\times L}\),
Overlapping patches of size \(P\) with a stride of \(S\),
- Use \((P = 16, S = 8)\)
Each patch is then flattened & Fed into a 2-layer FFN (into \(D\) dim)

c) Channel-mixing

Different with PatchTST!

PatchTST: CI
TimePFN: CD

Input all tokens into the transformer encoder after applying the positional encodings

\(\rightarrow\) Tokens from different variates can attend to each other!

d) Transformer Encoder

Naive Transformer
with Transformer output..
- Step 1) Rearrange them into their respective channels
- Step 2) Channel-wise flattening operation.
- Step 3) 2-layer NN: Processes the flattened variate representations using shared weights

e) Normalization

Normalize each variate \(x_i\) into \(N(0,1)\) to any other process described above (feat. RevIN)

Before forecasting, we revert the TS to its original scale (by de-normalizing)

f) Architectural Details

Fixed input sequence
Fixed forecasting lengths
Arbitrary number of variates

Twitter Facebook LinkedIn

TimePFN; Effective Multivariate Time Series Forecasting with Synthetic Data

Seunghan Lee

TimePFN: Effective Multivariate Time Series Forecasting with Synthetic Data

Contents

Abstract

TimePFN

1. Introduction

TimePFN

Contributions

(1) TS Forecasting (TSF)

(2) Zero-shot TSF

3. Proposed Methods

(1) PFN for MTS Forecasting

(2) Synthetic MTS Data Generation

KernelSynth (feat. Chronos (2024))

Correlated Variables (Goal 2)?

(3) Architecture for TimePFN

a) Convolutional Filtering

b) Patch Embeddings

c) Channel-mixing

d) Transformer Encoder

e) Normalization

f) Architectural Details

You May Also Enjoy

TimePFN; Effective Multivariate Time Series Forecasting with Synthetic Data

Seunghan Lee

TimePFN: Effective Multivariate Time Series Forecasting with Synthetic Data

Contents

Abstract

TimePFN

1. Introduction

TimePFN

Contributions

2. Related Works

(1) TS Forecasting (TSF)

(2) Zero-shot TSF

3. Proposed Methods

(1) PFN for MTS Forecasting

(2) Synthetic MTS Data Generation

KernelSynth (feat. Chronos (2024))

Correlated Variables (Goal 2)?

(3) Architecture for TimePFN

a) Convolutional Filtering

b) Patch Embeddings

c) Channel-mixing

d) Transformer Encoder

e) Normalization

f) Architectural Details

You May Also Enjoy