TimePFN: Effective Multivariate Time Series Forecasting with Synthetic Data
https://arxiv.org/pdf/2502.16294
Contents
- Absract
- Introduction
- Related Works
- Proposed Methods
- PFN for MTS Forecasting
- Synthetic MTS Data Generation
- Architecture for TimePFN
Abstract
TimePFN
- Task: MTS Forecasting
- (1) Generating synthetic MTS data
- (2) Novel MTS architecture
- Capturing both TD & CD across all input patches
Experiments
- (1) **SoTA for MTS forecasting **
- zero-shot & few-shot
- (2) Fine-tuning TimePFN
- with 500 data points \(\rightarrow\) nearly matches full dataset training error
- (3) Strong UTS forecasting
1. Introduction
Existing setup: Train task == Test task
- [Good] “Large”-scaled datasets
- [Bad] “Small”-scaled datasets / “OOD” test set
TimePFN
MTS forecasting from a “data-centric” perspective
- [Dataset] Generate realistic and diverse large-scale MTS data
- [Architecture] Capable of extracting TS features from this large-scale synthetic dataset
- Allows for transfer learning to novel tasks with arbitrary number of channels
Contributions
**1. [Dataset] **
- New method to generate “synthetic MTS”
- via “Gaussian processes” with kernel compositions and a linear coregionalization model
2. [Architecture]
-
Variation of PatchTST for MTS forecasting
- Incorporates channel mixing
- Employs a “convolutional embedding” for patch embeddings
\(\rightarrow\) Effectively extract cross-channel relations
-
First PFN for MTS forecasting
- Strong “few/zero-shot” performance
- (+ Strong UTS forecasting performance)
2. Related Works
(1) TS Forecasting (TSF)
(1) Informer
- ProbSparse attention
- Quadratic complexity \(\rightarrow\) “Log-linear” complexity
(2) Fedformer
- Uses sparsity of the TS in the “fourier domain”
(3) PatchTST
- Tokenization: Patching with overlapping strides as a
- CI: Each channel as univariate
- Joint learning across all channels through the same set of shared weights
(4) iTransformer
- 1 variate = 1 token
- Benefits of utilizing CD
(5) TimePFN (Proposed)
- Deviates from PatchTST!
- (1) Incorporate convolutional layers before patching
- (2) Use channel-mixing to capture interactions between tokens from different channels
(2) Zero-shot TSF
(1) Chronos
- Novel tokenization methods, employed quantization, and made TS data resemble language
- Enable the training of LLM architectures for probabilistic univariate forecasting
- Employs a data augmentation technique called KernelSynth
- Generates synthetic timeseries data using Gaussian processes
(2) ForecastPFN
- Trained entirely on a synthetic dataset
(3) JEPA + PFNs
- Integrates Joint-Embedding Predictive Architectures with PFNs for zero-shot forecasting
(4) Mamba4Cast
- Trained entirely on synthetic data using the Mamba architecture as its backbone
(5) TimePFN (Proposed)
- Introduces the first MTS PFN
- Architecture that enables strong zero-shot and few-shot performances
3. Proposed Methods
Two key aspects
- (1) Synthetic MTS generation (encapsulates TD & CD)
- (2) Architecture capable of generalization to real datasets when trained on such a dataset
(1) PFN for MTS Forecasting
\(\mathcal{D} := \{t, \mathbf{X}_t\}_{t=1}^{T}\): \(N\)-channel MTS
- \(\mathbf{X}_t := [x_{t,1}, \ldots, x_{t,N}]\).
Notation
- Hypothesis space: \(\Omega\)
- with a prior distribution \(p(\omega)\)
- Hypothesis: \(\omega \in \Omega\)
- Models a MTS generating process (e.g., \(\mathbf{X}_t = \omega(t)\))
- Example)
- \(\Omega\): Space of hypotheses for VAR models
- Particular instance \(\omega \in \Omega\) corresponds to a specific VAR process (e.g., VAR(2))
- \(\Omega\): Space of hypotheses for VAR models
- \(p(\cdot \mid T, \mathcal{D})\): Posterior predictive distribution (PPD) of \(\mathbf{x} \in \mathbb{R}^N\) at time \(T\)
\(\begin{equation} p(\mathbf{x} \mid T, \mathcal{D}) \propto \int_{\Omega} p(\mathbf{x} \mid T, \omega)\, p(\mathcal{D} \mid \omega)\, p(\omega)\, d\omega. \end{equation}\).
Posterior predictive distribution (PPD)
- Approximated using PFNs
Procedures
- Step 1) Iteratively sample a hypothesis \(\omega\)
- from the hypothesis space \(\Omega\)
- according to the probability \(p(\omega)\)
- Step 2) Generate a prior dataset \(\mathcal{D}\) from this hypothesis
- \(\mathcal{D} \sim p(\mathcal{D} \mid \omega)\).
-
Step 3) Optimize the parameters of the PFN on these generated datasets
- Datasets
- \(\mathcal{D}_{\text{input}} := \{t, \mathbf{X}_t\}_{t=1}^{\tilde{T}}\).
- \(\mathcal{D}_{\text{output}} := \{t, \mathbf{X}_t\}_{t=\tilde{T}+1}^{T}\).
- Train the PFN to forecast \(\mathcal{D}_{\text{output}}\) from \(\mathcal{D}_{\text{input}}\) using standard models
- Datasets
[TimePFN] Hypothesis space \(\Omega\) : Single-input, multi-output Gaussian processes
- represented by the linear model of coregionalization (LMC)
(2) Synthetic MTS Data Generation
Goal of MTS data generation
- Goal 1) “Realistic” variates
- Goal 2) “Correlated” variates
KernelSynth (feat. Chronos (2024))
-
Addresses Goal 1)
- Enrich its training corpus by “randomly composing kernels” to generate diverse, synthetic UTS
- Kernels: Binary operators (e.g., addition and multiplication)
- Aggregates kernels of various types with different parameters
- (various types) Linear, Periodic, Squared-Exponential, Rational, and Quadrati
- (parameters) daily, weekly, and monthly periodic kernels
Correlated Variables (Goal 2)?
[TimePFN] Generative Gaussian modelling
- Linear model of coregionalization (LMC)
- Outputs are obtained as linear combinations of independent latent random functions
Given \(t \in \mathbb{R}^T\), the outputs in each channel \(\{C_i(t)\}_{i=1}^{N}\) is the linear combination of \(L\) latent functions
-
\(C_i(t) = \sum_{j=1}^{L} \alpha_{i,j} \, l_j(t)\).
- Latent functions are independent with zero-mean
- Resulting ouptut covarince = PSD function with zero-mean
Convex combinations (to avoid scaling issues)
- For each \(i\), \(\alpha_{i,1} + \cdots + \alpha_{i,L} = 1\) with \(\alpha_{i,j} \ge 0\)
LMC formulation: Encapsulates the cases where the correlations between different variates are small or nonexistent
- e.g., independent variables: \(L = N\) with \(C_i(t) = l_i(t)\).
\(\rightarrow\) Such a modelling is important, as …
- (1) Some MTS data have strong correlation
- (2) Some MTS data have weak correlation
LMC-Synth
- Sample the number of latent functions: from a Weibull distribution
- Sample \([\alpha_{i,1}, \ldots, \alpha_{i,L}]\): from a Dirichlet distribution
(3) Architecture for TimePFN
Goal: Architecture to achieve better generalization when applied to real-world datasets
Primary advantage of the PFN framework ?
-
As, synthesizing large-scale synthetic MTS data is feasible with LMC-Synth
\(\rightarrow\) No longer constrained by data scarcity
-
Previous MTS models
-
Compelled to balance model complexity vs. limited data
\(\rightarrow\) Often restricting the use of certain components to avoid overfitting
-
-
TimePFN: Access to large-scale MTS data
\(\rightarrow\) Expand the architecture!!!
TimePFN
- Resembles PatchTST
- Differs significantly in two areas
- (1) Convolutional filtering of the variates (prior to patching)
- (2) Channel-mixing.
a) Convolutional Filtering
Done before patching!
Procedures
- Step 1) Learnable 1D conv to each variate (shared weights)
- Step 2) 1D magnitude max pooling to each newly generated variate
\(\rightarrow\) New set of 1D convolutions!
Notation
- MTS dataset
- \(X = [x_1 . . . x_N ]\), with \(N\) variates with \(L\) length
- Convolutional filtering: \(x_i \in \mathbb{R}^L\) \(\rightarrow\) \(\bar{x}_i \in \mathbb{R}^{(C+1)\times L}\),
- \(C\) rows: 1D conv + magnitude max pooling ( Use \(C = 9\) in practice )
- \(1\) row: Original \(x_i\).
- Convolutions is a valuable tool!
- e.g., differencing to de-trend data can be effectively represented by convolutions
b) Patch Embeddings
-
Input: \(\bar{x}_i \in \mathbb{R}^{(C+1)\times L}\),
- Overlapping patches of size \(P\) with a stride of \(S\),
- Use \((P = 16, S = 8)\)
- Each patch is then flattened & Fed into a 2-layer FFN (into \(D\) dim)
c) Channel-mixing
Different with PatchTST!
- PatchTST: CI
- TimePFN: CD
Input all tokens into the transformer encoder after applying the positional encodings
\(\rightarrow\) Tokens from different variates can attend to each other!
d) Transformer Encoder
- Naive Transformer
- with Transformer output..
- Step 1) Rearrange them into their respective channels
- Step 2) Channel-wise flattening operation.
- Step 3) 2-layer NN: Processes the flattened variate representations using shared weights
e) Normalization
Normalize each variate \(x_i\) into \(N(0,1)\) to any other process described above (feat. RevIN)
Before forecasting, we revert the TS to its original scale (by de-normalizing)
f) Architectural Details
- Fixed input sequence
- Fixed forecasting lengths
- Arbitrary number of variates