Foundation Models for Time Series; A Survey

https://arxiv.org/pdf/2504.04011

Kottapalli, Siva Rama Krishna, et al. “Foundation Models for Time Series: A Survey.” arXiv 2025

Abstract

Foundation models for TS

Architecture design
Patch-based vs. Directly on raw TS
Probabilistic vs. Deterministic
Univariate TS vs. Multivariate TS
Lightweight vs. Large-scale
Type of objective function

1. Introduction

(1) NN for TS Analysis

(2) Transformer Paradigm

(3) Transformer: Foundation Models for TS

2. Background

(1) Unique Characteristics of TS

Sequential Nature
Temporal Dependencies
Multivariate Complexities
Irregular Sampling & Missing Data
Noise & Non-Stationarity
High Dimensionality in Long Sequences

(2) Key Innovations of Transformers

a) Attention mechanism & its role in Sequential data

Attention mechanisms provide the following advantages:

(1) Long-range dependency modeling
(2) Dynamic weighting
(3) Context-aware representations

b) Scalability & Parallelism

(1) Non-sequential processing
(2) Efficient handling of long-sequences
- But \(O(n^2)\) complexity \(\rightarrow\) Sparse atetntion, linear Transformer ..

c) Implication for TS modeling

The attention mechanism enables models to …

(1) Capture complex temporal dynamics
- e.g., seasonality and long-term dependencies
(2) Scalability ensures that these models remain practical for large-scale datasets

(3) TS Applications

a) TS Forecasting

b) TS Imputation

Transformer: Excel in learning contextual relationships to infer missing values
- e.g., bidirectional attention, and encoder-decoder frameworks
TimeTransformer [80]
- Utilize self-attention mechanisms to predict missing data points in multidimensional datasets.

c) Anomaly Detection

Transformer: Powerful framework for anomaly detection due to their capacity for learning contextual representations
Pretrained models
- Fine-tuned for anomaly detection tasks
- By leveraging embeddings that capture normal behavior patterns
Transformer + VAE [84]
Transformer + GAN [85]

\(\rightarrow\) Further enhance AD by enabling unsupervised or semi-supervised learning

d) TS Classification

e) Change Point Detection

Task = Identifies moments when the statistical properties of a TS shift

E.g., Detecting financial market shifts, climate pattern changes, and network traffic anomalies.

f) TS Clustering

(4) FMs for TS

a) Characteristics of FMs

Universal backbone for diverse downstream tasks

[Two-stage process]

(1) Pretraining
(2) Fine-tuning

The ability of foundation models to generalize stems from several key properties:

(1) Task-agnostic pretraining objectives (SSL)
- NSP, MTM, CL ..
(2) Scalability across domains
- Trained on heterogeneous datasets spanning multiple domains
- Enhances their robustness and transferability to unseen tasks
(3) Adaptability through fine-tuning

3. Taxonomy

(1) Challenges in analyzing the field

(2) Lack of Detailed Taxonomy

Key dimensions include …

Model Architecture
- (Transformer) Encoder-only, Decoder-only, Encoder-decoder
- (Non-Transformer) e.g., Tiny Time Mixers (TTM)
Patch vs. Non-Patch
- (Patch) Capture local temporal patterns before learning global dependencies
- (Non-patch) Capture both short-and long-term dependencies across the full sequence
Objective Functions
- (MSE) Regression tasks
- (NLL) Probabilistic estimates that improve uncertainty modeling
UTS vs. MTS
Probabilistic vs. Non-probabilistic
Model scale & complexity

4. Methodology

(1) Model Architecture

a) Non-Transformer

Tiny Time Mixers (TTM) (NeurIPS 2024)

(https://arxiv.org/pdf/2401.03955)

Vijay Ekambaram, and others, Tiny Time Mixers (TTMs): Fast Pre-Trained Models
for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series, NeurIPS 2024.

Based on TSMixer
Details
- a) Adaptive patching: To handle multi-resolution
  - Different layers of the backbone operate at varying patch lengths
- b) Diverse resolution sampling: To augment data to improve coverage across varying temporal resolutions
- c) Resolution prefix tuning: To handle pretraining on varied dataset resolutions with minimal model capacity
- d) Multi-level modeling: Capture channel correlations and infuse exogenous signals during fine-tuning.
- e) Supports channel correlations and exogenous signals

b) Encoder-decoder

TimeGPT (arxiv 2023)

(https://arxiv.org/pdf/2310.03589)

Azul Garza, Cristian Challu, and Max Mergenthaler-Canseco, TimeGPT-1, arXiv
preprint arXiv, 2023.

For TS forecasting
Components from LLMs + CNN
Details
- [Transformer] Positional encoding & Multi-head attention
  - + Residual connections + LN
- [CNN] For learning complex temporal patterns
- [Dataset] Large, diverse time-series datasets
Fine-tuned for specific forecasting tasks
- Using zero-shot or few-shot learning methods

c) Encoder-only

MOMENT (ICML 2024)

(https://arxiv.org/pdf/2402.03885)

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and
Artur Dubrawski, MOMENT: A Family of Open Time-series Foundation Models,
ICML, 2024.

Details
- [Arch] Patching + Transformer + Relative PE + Instance norm
- [SSL] MTM
- [Dataset] Pretrained on a diverse collection of datasets ( Time Series Pile )
- [Task] Forecasting, anomaly detection, and classification …
Key features
- Handling variable-length TS
- Scalability through a simple encoder and minimal parameters
- Channel independence

MOIRAI (ICML 2024)

(https://arxiv.org/pdf/2402.02592)

Jacek Cyranka, and Szymon Haponiuk, Unified Long-Term Time-Series Forecasting Benchmark, arXiv preprint arXiv:2309.15946, 2023, https://arxiv.org/abs/2309.
15946

Probabilistic MTS forecasting
Handle data with varying frequencies and domains
Details
- [Arch] Patching + Transformer
  - Pre-normalization, RMSNorm, query-key normalization, and SwiGLU ..
- [SSL] MTM
  - Trained with a CE loss
    
    \(\rightarrow\) Treating the forecasting task as a regression via classification
  - Dataset: LOTSA
Key features
- Output = Mixture distribution
  - Capturing predictive uncertainty
  - Including Student’s t-distribution, negative binomial, log-normal, and low-variance normal distributions
- Flexible patch size: To handle different frequencies (based on predefined size)
  - Larger patches for high-frequency data
  - Smaller ones for low-frequency data.
- Any-variate Attention mechanism
  - Flattens MTS into a single sequence

d) Decoder-only

Timer-XL (ICLR 2025)

(https://arxiv.org/pdf/2410.04803)

Yuxuan Liu, Ganqu Qin, Xiangyang Huang, Jiang Wang, and Mingsheng Long,
Timer-XL: Long-Context Transformers for Unified Time Series Forecasting, arXiv
preprint arXiv:2410.04803, 2024.

Key innovation: TimeAttention mechanism
- Capture complex dependencies within and across TS
- Incorporates both TD& CD via a Kronecker product approach
Details
- [SSL] NTP
- [UTS & MTS]
  - For UTS
  - For MTS extends this approach by defining tokens for each variable and learning dependencies between them
- Rotary Position Embeddings (RoPE)
- Capable of handling additional covariates

Time-MOE (ICLR 2025)

(https://arxiv.org/pdf/2409.16040)

Xiang Shi, and others, Time-MoE: Billion-Scale Time Series Foundation Models
with Mixture of Experts, ICLR, 2025.

MoE + Decoder-only
- MoE: Replace FFN with MoE layer
Details
- Point-wise tokenization: For efficient handling of variable-length sequences
  - + SwiGLU gating to embed time series points.
- Multi-resolution forecasting
  - Allowing predictions at multiple time scales (different forecasting horizons)

Toto

(https://arxiv.org/pdf/2407.07874)

Ben Cohen, Emaad Khwaja, Kan Wang, Charles Masson, Elise Ramé, Youssef
Doubli, and Othmane Abou-Amal, Toto: Time Series Optimized Transformer for
Observability, arXiv preprint arXiv:2407.07874, 2024, https://arxiv.org/abs/2407.
07874.

For MTS forecasting \(\rightarrow\) Handle both TD & CD
Decoder-only model
Details
- [SSL] NTP
- Probabilistic prediction head: Student-T Mixture Model (SMM)
  - Handle heavy-tailed distributions and outliers
- Quantify uncertainty through Monte Carlo sampling

Timer (ICML 2024)

(https://arxiv.org/pdf/2402.02368)

Yuxuan Liu, Hao Zhang, Chenhan Li, Xiangyang Huang, Jiang Wang, and
Mingsheng Long, Timer: Generative Pre-Trained Transformers Are Large Time
Series Models, ICML, 2024.

Decoder-only model
Dataset 1: Unified Time Series Dataset (UTSD)
- Up to 1 billion time points across seven domains
Dataset 2: Large-scale Open Time Series Archive (LOTSA)
- Over 27B observations across nine domains
- For zero-shot forecasting
Single-series sequence (S3)
- Unified format to handle diverse time series data
- For easier preprocessing and normalization w/o the need for alignment across domains
Pretraining task
- Decoder-only \(\rightarrow\) Autoregressive Generative pre-training

Timer

UTSD

TimesFM (ICML 2024)

(https://arxiv.org/pdf/2310.10688)

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou, A decoder-only
foundation model for time-series forecasting, ICML, 2024.

Patchify TS
Decoder-only architecture
[SSL] Next Patch Prediction
Random masking strategy
- To handle variable context (input) lengths
Summary: Flexibility in forecast horizons and context lengths

Lag-LLaMA

(https://arxiv.org/pdf/2310.08278)

Kashif Rasul, and others, Lag-Llama: Towards Foundation Models for Time Series
Forecasting, R0-FoMo:Robustness of Few-shot and Zero-shot Learning in Large
Foundation Models, 2023, https://openreview.net/forum?id=jYluzCLFDM.

Univariate probabilistic TS forecasting
Based on LLaMA
- Decoder-only Transformer with causal masking
- Rotary Positional Encoding (RoPE)
Specialized tokenization scheme: Includes ..
- (1) Lagged features (past values at specified lags)
- (2) Temporal covariates (e.g., day-of-week, hour-of-day)
\(\rightarrow\) Handle varying frequencies
Probabilsitic forecasting
- Output is passed through a distribution head
  
  ( Predicts the parameters of a probability distribution )
Loss function: NLL
Inference: Multiple forecast trajectories through autoregressive decoding

e) Adapting LLM

Chronos

(https://arxiv.org/pdf/2403.07815)

Ahmed F. Ansari, and others, Chronos: Learning the Language of Time Series,
arXiv preprint, 2024, https://arxiv.org/abs/2403.07815.

Adapts LLM for probabilistic TS forecasting
Novel tokenization approach
- Continuous TS \(\rightarrow\) Discrete tokens
- Step 1) Scaling the data (using mean normalization)
- Step 2) Quantizing it through a binning process
  - Values are assigned to predefined bins
Loss function: CE loss \(\rightarrow\) Learn multimodal distributions
Base model:
- T5 (encoder-decoder model)
- (But can also be adapted to decoder-only models )
Architecture remains largely unchanged from standard language models
Minor adjustmnets
- Vocabulary size to account for the quantization bins
Pretraining task: Autoregressive probabilistic predictions

AutoTimes

(https://arxiv.org/pdf/2402.02370)

Yuxuan Liu, Ganqu Qin, Xiangyang Huang, Jiang Wang, and Mingsheng Long,
AutoTimes: Autoregressive Time Series Forecasters via Large Language Models,
arXiv preprint arXiv:2402.02370, 2024.

Adapts LLMs for MTS forecasting
Patchify TS
- Each segment = Single variate (treated independently)
Timestamp position embeddings
Pretraining task: Next token prediction
Handle varying lookback & forecast lengths
(Summary) Key innovations
- Segment-wise tokenization
- Timestamp embeddings for temporal context
- Autoregressive multi-step forecasting

LLMTime

(https://arxiv.org/abs/2310.07820)

Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson, Large
Language Models Are Zero-Shot Time Series Forecasters, NeurIPS 2023.

Pretraining task: Next-token prediction
TS = String of numerical digits
- Each time step = Individual digits separated by spaces

TIME-LLM

(https://arxiv.org/pdf/2310.01728)

Mingyu Jin, and others, Time-LLM: Time Series Forecasting by Reprogramming
Large Language Models, ICLR 2024.

Reprogramming framework
- Adapts LLM to TS forecasting, w/o fine-tuning the backbone
Transforming TS into text prototype representations
Input TS : Before being reporgrammed with learned text prototypes…
- Univarate TS + normalized, patched, embedded
Prompts: Augmented with domain-specific prompts
Architecture
- Frozen LLM
- Only the input transformation and output projection parameters updated
\(\rightarrow\) Allow for efficient few-shot and zero-shot forecasting

Frozen Pretrained Transformer (FPT)

(https://arxiv.org/pdf/2103.05247)

Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch, Frozen Pretrained
Transformers as Universal Computation Engines, AAAI 2022

Leverages pre-trained language or vision models
- e.g., GPT [58], BERT [96], and BEiT [138]
Freeze vs. Fine-tuning
- [Freeze] Self-attention and FFN
- [Fine-tune] Positional embedding, layer normalization, output layers
Redesigned input embedding layer to project TS data into the required dimensions, employing linear probing to reduce training parameters

(2) Patch vs. Non-Patch

a) Patch-based

Tiny Time Mixers
- Non overlapping windows as patches during pre-training phase
Timer-XL
- Patch-level generation based on long-context sequences for MTS forecasting
Toto
- Pre-trained on the next patch prediction
MOMENT
- Dividing TS into fixed-length segments, embedding each segment
- Pretrain with MTM
MOIRAI
- Patch-based approach to modeling time series with a masked encoder architecture
AutoTimes
- Each segment representing a single variate ( = Treated as individual tokens )
- Capture inter-variate correlations while simplifying the temporal structure for the LLM
Timer
- TS is processed as single-series sequences (S3)
  
  = Each TS as a sequence of tokens
TimesFM
- Input TS is split into non-overlapping patches
TIME-LLM
- Divite MTS into univariate patches
- Reprogrammed with learned text prototypes
Frozen Pretrained Transformer (FPT)
- Patching

b) Non Patch-based

Time-MOE

Point-wise tokenization

TimeGPT

Chronos

Discretizing the TS values into bins rather than splitting the data into fixed-size patches

Lag-LLaMA

Does not use patching or segmentation
Rather, tokenizes TS data by incorporating lagged features and temporal covariates
- Each token = Past values at specified lag indices + Additional time-based features

LLMTime

TS as a string of numerical digits

( = Treating each time step as a sequence of tokens )

(3) Objective Functions

a) MSE

Tiny Time Mixers
Timer-XL
MOMENT
AutoTimes
Timer
TimesFM
TIME-LLM
Frozen Pretrained Transformer (FPT)

b) Huber Loss

( by Time-MOE )

Combines the advantages of MSE & MAE

\(L_{\delta}(r) = \begin{cases} \frac{1}{2} r^2 & \text{if} \ \mid r \mid \leq \delta \\ \delta ( \mid r \mid - \frac{1}{2} \delta) & \text{if} \ \mid r \mid > \delta \end{cases}\).

where:

\(\delta > 0\) is a user-defined threshold.
If the residual is small (\(\mid r \mid \leq \delta\)), it behaves like MSE.
If the residual is large (\(\mid r \mid > \delta\)), it behaves like MAE but transitions smoothly.

Summary

For small errors, it uses the squared error (sensitive to small deviations).
For large errors, it switches to absolute error (robust to outliers).
Improve robustness to outliers and ensure stability during training

c) LL & NLL

Toto
Chronos
Lag-LLaMA
MOIRAI
LLMTime (only training, no pretraining in LLMTime)

(4) UTS vs. MTS

a) Univariate

TimeGPT & Chronos & MOMENT & Lag-LLaMA

Only UTS

b) Multivariate

MOMENT & MOIRAI & Frozen Pretrained Transformer (FPT) & Tiny Time Mixers & Time-XL & Time-MOE & Toto & AutoTimes

Both UTS & MTS

Timer

Primarily supports UTS
But can treat MTS by flattening into single sequence! (feat. S3)

TimesFM

Appears to focus on UTS (no support for MTS)
But still could theoretically accommodate MTS

(5) Probabilistic vs. Non-probabilistic

(6) Model scale & complexity

Twitter Facebook LinkedIn

Foundation Models for Time Series; A Survey

Seunghan Lee

Foundation Models for Time Series; A Survey

Abstract

1. Introduction

(1) NN for TS Analysis

(2) Transformer Paradigm

(3) Transformer: Foundation Models for TS

2. Background

(1) Unique Characteristics of TS

(2) Key Innovations of Transformers

a) Attention mechanism & its role in Sequential data

b) Scalability & Parallelism

c) Implication for TS modeling

(3) TS Applications

a) TS Forecasting

b) TS Imputation

c) Anomaly Detection

d) TS Classification

e) Change Point Detection

f) TS Clustering

(4) FMs for TS

a) Characteristics of FMs

3. Taxonomy

(1) Challenges in analyzing the field

(2) Lack of Detailed Taxonomy

4. Methodology

(1) Model Architecture

a) Non-Transformer

b) Encoder-decoder

c) Encoder-only

d) Decoder-only

e) Adapting LLM

(2) Patch vs. Non-Patch

a) Patch-based

b) Non Patch-based

(3) Objective Functions

a) MSE

b) Huber Loss

c) LL & NLL

(4) UTS vs. MTS

a) Univariate

b) Multivariate

(5) Probabilistic vs. Non-probabilistic

(6) Model scale & complexity

You May Also Enjoy