TimePFN: Effective Multivariate Time Series Forecasting with Synthetic Data

https://arxiv.org/pdf/2502.16294


Contents

  1. Absract
  2. Introduction
  3. Related Works
  4. Proposed Methods
    1. PFN for MTS Forecasting
    2. Synthetic MTS Data Generation
    3. Architecture for TimePFN


Abstract

TimePFN

  • Task: MTS Forecasting
  • (1) Generating synthetic MTS data
  • (2) Novel MTS architecture
    • Capturing both TD & CD across all input patches


Experiments

  • (1) **SoTA for MTS forecasting **
    • zero-shot & few-shot
  • (2) Fine-tuning TimePFN
    • with 500 data points \(\rightarrow\) nearly matches full dataset training error
  • (3) Strong UTS forecasting


1. Introduction

Existing setup: Train task == Test task

  • [Good] “Large”-scaled datasets
  • [Bad] “Small”-scaled datasets / “OOD” test set


TimePFN

MTS forecasting from a “data-centric” perspective

  • [Dataset] Generate realistic and diverse large-scale MTS data
  • [Architecture] Capable of extracting TS features from this large-scale synthetic dataset
    • Allows for transfer learning to novel tasks with arbitrary number of channels


Contributions

**1. [Dataset] **

  • New method to generate “synthetic MTS”
    • via “Gaussian processes” with kernel compositions and a linear coregionalization model

2. [Architecture]

  • Variation of PatchTST for MTS forecasting

    • Incorporates channel mixing
    • Employs a “convolutional embedding” for patch embeddings

    \(\rightarrow\) Effectively extract cross-channel relations

  • First PFN for MTS forecasting

    • Strong “few/zero-shot” performance
    • (+ Strong UTS forecasting performance)


2. Related Works

(1) TS Forecasting (TSF)

(1) Informer

  • ProbSparse attention
  • Quadratic complexity \(\rightarrow\) “Log-linear” complexity

(2) Fedformer

  • Uses sparsity of the TS in the “fourier domain”

(3) PatchTST

  • Tokenization: Patching with overlapping strides as a
  • CI: Each channel as univariate
  • Joint learning across all channels through the same set of shared weights

(4) iTransformer

  • 1 variate = 1 token
  • Benefits of utilizing CD

(5) TimePFN (Proposed)

  • Deviates from PatchTST!
    • (1) Incorporate convolutional layers before patching
    • (2) Use channel-mixing to capture interactions between tokens from different channels


(2) Zero-shot TSF

(1) Chronos

  • Novel tokenization methods, employed quantization, and made TS data resemble language
  • Enable the training of LLM architectures for probabilistic univariate forecasting
  • Employs a data augmentation technique called KernelSynth
    • Generates synthetic timeseries data using Gaussian processes

(2) ForecastPFN

  • Trained entirely on a synthetic dataset

(3) JEPA + PFNs

  • Integrates Joint-Embedding Predictive Architectures with PFNs for zero-shot forecasting

(4) Mamba4Cast

  • Trained entirely on synthetic data using the Mamba architecture as its backbone

(5) TimePFN (Proposed)

  • Introduces the first MTS PFN
  • Architecture that enables strong zero-shot and few-shot performances


3. Proposed Methods

Two key aspects

  • (1) Synthetic MTS generation (encapsulates TD & CD)
  • (2) Architecture capable of generalization to real datasets when trained on such a dataset


(1) PFN for MTS Forecasting

\(\mathcal{D} := \{t, \mathbf{X}_t\}_{t=1}^{T}\): \(N\)-channel MTS

  • \(\mathbf{X}_t := [x_{t,1}, \ldots, x_{t,N}]\).


Notation

  • Hypothesis space: \(\Omega\)
    • with a prior distribution \(p(\omega)\)
  • Hypothesis: \(\omega \in \Omega\)
    • Models a MTS generating process (e.g., \(\mathbf{X}_t = \omega(t)\))
  • Example)
    • \(\Omega\): Space of hypotheses for VAR models
      • Particular instance \(\omega \in \Omega\) corresponds to a specific VAR process (e.g., VAR(2))
  • \(p(\cdot \mid T, \mathcal{D})\): Posterior predictive distribution (PPD) of \(\mathbf{x} \in \mathbb{R}^N\) at time \(T\)


\(\begin{equation} p(\mathbf{x} \mid T, \mathcal{D}) \propto \int_{\Omega} p(\mathbf{x} \mid T, \omega)\, p(\mathcal{D} \mid \omega)\, p(\omega)\, d\omega. \end{equation}\).


Posterior predictive distribution (PPD)

  • Approximated using PFNs


Procedures

  • Step 1) Iteratively sample a hypothesis \(\omega\)
    • from the hypothesis space \(\Omega\)
    • according to the probability \(p(\omega)\)
  • Step 2) Generate a prior dataset \(\mathcal{D}\) from this hypothesis
    • \(\mathcal{D} \sim p(\mathcal{D} \mid \omega)\).
  • Step 3) Optimize the parameters of the PFN on these generated datasets

    • Datasets
      • \(\mathcal{D}_{\text{input}} := \{t, \mathbf{X}_t\}_{t=1}^{\tilde{T}}\).
      • \(\mathcal{D}_{\text{output}} := \{t, \mathbf{X}_t\}_{t=\tilde{T}+1}^{T}\).
    • Train the PFN to forecast \(\mathcal{D}_{\text{output}}\) from \(\mathcal{D}_{\text{input}}\) using standard models


[TimePFN] Hypothesis space \(\Omega\) : Single-input, multi-output Gaussian processes

  • represented by the linear model of coregionalization (LMC)


(2) Synthetic MTS Data Generation

Goal of MTS data generation

  • Goal 1) “Realistic” variates
  • Goal 2) “Correlated” variates


KernelSynth (feat. Chronos (2024))

  • Addresses Goal 1)

  • Enrich its training corpus by “randomly composing kernels” to generate diverse, synthetic UTS
    • Kernels: Binary operators (e.g., addition and multiplication)
  • Aggregates kernels of various types with different parameters
    • (various types) Linear, Periodic, Squared-Exponential, Rational, and Quadrati
    • (parameters) daily, weekly, and monthly periodic kernels


Correlated Variables (Goal 2)?

[TimePFN] Generative Gaussian modelling

  • Linear model of coregionalization (LMC)
  • Outputs are obtained as linear combinations of independent latent random functions


Given \(t \in \mathbb{R}^T\), the outputs in each channel \(\{C_i(t)\}_{i=1}^{N}\) is the linear combination of \(L\) latent functions

  • \(C_i(t) = \sum_{j=1}^{L} \alpha_{i,j} \, l_j(t)\).

  • Latent functions are independent with zero-mean
  • Resulting ouptut covarince = PSD function with zero-mean


Convex combinations (to avoid scaling issues)

  • For each \(i\), \(\alpha_{i,1} + \cdots + \alpha_{i,L} = 1\) with \(\alpha_{i,j} \ge 0\)


LMC formulation: Encapsulates the cases where the correlations between different variates are small or nonexistent

  • e.g., independent variables: \(L = N\) with \(C_i(t) = l_i(t)\).

\(\rightarrow\) Such a modelling is important, as …

  • (1) Some MTS data have strong correlation
  • (2) Some MTS data have weak correlation


LMC-Synth

  • Sample the number of latent functions: from a Weibull distribution
  • Sample \([\alpha_{i,1}, \ldots, \alpha_{i,L}]\): from a Dirichlet distribution


(3) Architecture for TimePFN

Goal: Architecture to achieve better generalization when applied to real-world datasets


Primary advantage of the PFN framework ?

  • As, synthesizing large-scale synthetic MTS data is feasible with LMC-Synth

    \(\rightarrow\) No longer constrained by data scarcity

  • Previous MTS models

    • Compelled to balance model complexity vs. limited data

      \(\rightarrow\) Often restricting the use of certain components to avoid overfitting

  • TimePFN: Access to large-scale MTS data

    \(\rightarrow\) Expand the architecture!!!


TimePFN

  • Resembles PatchTST
  • Differs significantly in two areas
    • (1) Convolutional filtering of the variates (prior to patching)
    • (2) Channel-mixing.


a) Convolutional Filtering

Done before patching!


Procedures

  • Step 1) Learnable 1D conv to each variate (shared weights)
  • Step 2) 1D magnitude max pooling to each newly generated variate

\(\rightarrow\) New set of 1D convolutions!


Notation

  • MTS dataset
    • \(X = [x_1 . . . x_N ]\), with \(N\) variates with \(L\) length
  • Convolutional filtering: \(x_i \in \mathbb{R}^L\) \(\rightarrow\) \(\bar{x}_i \in \mathbb{R}^{(C+1)\times L}\),
    • \(C\) rows: 1D conv + magnitude max pooling ( Use \(C = 9\) in practice )
    • \(1\) row: Original \(x_i\).
  • Convolutions is a valuable tool!
    • e.g., differencing to de-trend data can be effectively represented by convolutions


b) Patch Embeddings

  • Input: \(\bar{x}_i \in \mathbb{R}^{(C+1)\times L}\),

  • Overlapping patches of size \(P\) with a stride of \(S\),
    • Use \((P = 16, S = 8)\)
  • Each patch is then flattened & Fed into a 2-layer FFN (into \(D\) dim)


c) Channel-mixing

Different with PatchTST!

  • PatchTST: CI
  • TimePFN: CD

Input all tokens into the transformer encoder after applying the positional encodings

\(\rightarrow\) Tokens from different variates can attend to each other!


d) Transformer Encoder

  • Naive Transformer
  • with Transformer output..
    • Step 1) Rearrange them into their respective channels
    • Step 2) Channel-wise flattening operation.
    • Step 3) 2-layer NN: Processes the flattened variate representations using shared weights


e) Normalization

Normalize each variate \(x_i\) into \(N(0,1)\) to any other process described above (feat. RevIN)

Before forecasting, we revert the TS to its original scale (by de-normalizing)


f) Architectural Details

  • Fixed input sequence
  • Fixed forecasting lengths
  • Arbitrary number of variates

Categories: ,

Updated: