Are Transformers Effective for Time Series Forecasting?


Contents

  1. Abstract
  2. Introduction
  3. Preliminaries
  4. Transformoer-Based LTSF Solutions
  5. Embarassingly Simple Baseline
  6. Experiments
  7. Implementation Details
  8. LTSF & STSF
  9. Distribution Shift


0. Abstract

surge of Transformer-based solutions for the long-term time series forecasting (LTSF) task

\(\rightarrow\) this paper : question the validity of this line of research

Transformers : most successful solution to extract the semantic correlations among the elements in a long sequence.

However, in time series modeling, we are to extract the temporal relations in an ordered set of continuous points


[ Transformer ] positional encoding & tokens to embed sub-series

  • facilitate preserving some ordering information….

\(\rightarrow\) BUT the nature of the permutation-invariant self-attention mechanism inevitably results in temporal information loss

\(\rightarrow\) introduce a set of embarrassingly simple one-layer linear models named LTSF-Linear


https://github.com/cure-lab/LTSFLinear.


1. Introduction

The main working power of Transformers: multi-head self-attention mechanism

\(\rightarrow\) capability of extracting semantic correlations among elements in a long sequence


Problems of self-attention in TS :

permutation invariant & “anti-order” to some extent

  • using various types of positional encoding…? still inevitable to have temporal information loss

    ( NLP : not a serious concern for semantic rich applications )

    ( TS : usually a lack of semantics in the numerical data itself )

    \(\rightarrow\) order itself plays the most crucial role

Q. Are Transformers really effective for long-term time series forecasting?


(non-Transformer) Baselines ( used in Transformer-based papers )

  • perform autoregressive or iterated multi-step (IMS) forecasting
    • suffer from significant error accumulation effects for the LTSF problem

\(\rightarrow\) We challenge Transformer-based LTSF solutions with direct multi-step (DMS) forecasting strategies to validate their real performance


Hypothesize that long-term forecasting is only feasible for those time series with a relatively clear trend and periodicity.

\(\rightarrow\) linear models can already extract such information!

\(\rightarrow\) introduce a set of embarrassingly simple model, LTSF-Linear


LTSF-Linear

  • regresses historical time series with a one-layer linear model to forecast future time series directly.
  • conduct extensive experiments on nine widely-used benchmark datasets
  • show that LTSF-Linear outperforms existing complex Transformerbased models in all cases, and often by a large margin (20% ∼ 50%).
  • existing Transformers : most of them fail to extract temporal relations from long sequences
    • the forecasting errors are not reduced (sometimes even increased) with the increase of look-back window sizes.
  • conduct various ablation studies on existing Transformer-based TSF solutions


2. Preliminaries: TSF Problem Formulation

Notation

  • number of variates : \(C\)

  • historical data : \(\mathcal{X}=\left\{X_1^t, \ldots, X_C^t\right\}_{t=1}^L\)

    • lookback window size : \(L\)
    • \(i_{t h}\) variate at the \(t_{t h}\) time step : \(X_i^t\)


TSF task: predict \(\hat{\mathcal{X}}=\left\{\hat{X}_1^t, \ldots, \hat{X}_C^t\right\}_{t=L+1}^{L+T}\)

  • iterated multi-step (IMS) forecasting : learns a single-step forecaster & iteratively applies it
  • direct multistep (DMS) forecasting : directly optimizes the multi-step forecasting objective


IMS vs DMS

  • IMS ) have smaller variance thanks to the autoregressive estimation procedure
  • DMS) less error accumulation effects.


\(\rightarrow\) IMS forecasting is preferable when ….

  • (1) highly-accurate single-step forecaster
  • (2) \(T\) is relatively small


\(\rightarrow\) DMS forecasting is preferable when ….

  • (1) hard to obtain an unbiased single-step forecasting model
  • (2) \(T\) is large.


3. Transformer-Based LTSF Solutions

Transformer-based models to LTSF problems?

Limitations

  • (1) quadratic time/memory complexity
  • (2) error accumulation by autoregressive decoder
    • Informer : reduce complexity & DMS forecasting
    • etc) xxformers…

figure2


(1) TS decomposition

Common in TSF : normalization with zero-mean


Autoformer : applies seasonal-trend decomposition behind each neural block

  • TREND : MA kernel on the input sequence to extract the TREND
  • SEASONALITY : original - TREND


FEDformer : ( on top of Autoformer )

  • proposes the mixture of experts’ strategies to mix the TREND components extracted by MA kernels with various kernel sizes.


(2) Input Embedding

self-attention layer : cannot preserve the positional information of the time series.


Local positional information ( i.e. the ordering of time series ) is important

Global temporal information ( such as hierarchical timestamps (week, month, year) and agnostic timestamps (holidays and events) ) is also informative


SOTA Transformer : inject several embeddings

  • fixed positional encoding
  • channel projection embedding
  • learnable temporal embeddings
  • temporal embeddings with a temporal convolution layer
  • learnable timestamps


(3) Self-attention

Vanilla Transformer : \(O\left(L^2\right)\) ( too large )

Recent works propose two strategies for efficiency

  • (1) LogTrans, Pyraformer
    • explicitly introduce a sparsity bias into the self-attention scheme.
    • ( LogTrans ) uses a Logsparse mask to reduce the computational complexity to \(O(\log L)\)
    • ( Pyraformer ) adopts pyramidal attention that captures hierarchically multi-scale temporal dependencies with an \(O(L)\) time and memory complexity
  • (2) Informer, FEDformer, Autoformer
    • use the low-rank property in the self-attention matrix.
    • ( Informer ) proposes a ProbSparse self-attention mechanism and a self-attention distilling operation to decrease the complexity to \(O(L \log L)\),
    • ( FEDformer ) designs a Fourier enhanced block and a wavelet enhanced block with random selection to obtain \(O(L)\) complexity.
  • (3) Autoformer
    • designs a series-wise auto-correlation mechanism to replace the original self-attention layer.


(4) Decoders

Vanilla Transformer decoder

  • outputs sequences in an autoregressive manner

  • resulting in a slow inference speed and error accumulation effects

    ( especially for long-term predictions )


Use DMS strategies

  • Informer : designs a generative-style decoder for DMS forecasting.

  • Pyraformer : uses a FC layer concatenating Spatio-temporal axes as the decoder.
  • Autoformer : sums up two refined decomposed features from trend-cyclical components and the stacked auto-correlation mechanism for seasonal components to get the final prediction.
  • FEDformer : uses a decomposition scheme with the proposed frequency attention block to decode the final results.


The premise of Transformer models : semantic correlations between paired elements

  • self-attention mechanism itself is permutation-invariant

    \(\rightarrow\) capability of modeling temporal relations largely depends on positional encodings

  • there are hardly any point-wise semantic correlations between them


TS modeling

  • mainly interested in the temporal relations among a continuous set of points

    & order of these elements ( instead of the paired relationship ) plays the most crucial role

  • positional encoding and using tokens :

    • not sufficient! TEMPORAL INFORMATION LOSS!

\(\rightarrow\) Revisit the effectiveness of Transformer-based LTSF solutions.


4. An Embarrassingly Simple Baseline

figure2


(1) Linear

LTSF-Linear: \(\hat{X}_i=W X_i\),

  • \(W \in \mathbb{R}^{T \times L}\) : linear layer along the temporal axis

  • \(\hat{X}_i\) and \(X_i\) : prediction and input for each \(i_{t h}\) variate

    ( LTSF-Linear shares weights across different variates & does not model any spatial correlations )


[Linear] Vanilla Linear : 1-layer Linear model

2 variantes :

  • [DLinear] = Linear + Decomposition
  • [NLinear] = Linear + Normalization


DLinear

( enhances the performance of a vanilla linear when there is a clear trend in the data. )

  • step 1) decomposes a raw data input into a TREND & REMAINDER
    • use MA kernel
  • step 2) two 1-layer linear layer
    • one for TREND
    • one for REMAINDER
  • step 3) sum TREND & REMAINDER


NLinear

( when there is a distribution shift )

  • step 1) subtracts the input by the last value of the sequence
  • step 2) one 1-layer linear layer
  • step 3) add the subtracted value


5. Experiments

(1) Experimental Settings

a) Dataset

ETT (Electricity Transformer Temperature)

  • ETTh1, ETTh2, ETTm1, ETTm2

Traffic, Electricity, Weather, ILI, ExchangeRate

\(\rightarrow\) all of them are MTS


b) Compared Methods

5 transformer based methods

  • FEDformer, Autoformer, Informer, Pyraformer, LogTrans

naive DMS method:

  • Closest Repeat (Repeat)


( two variants of FEDformer )

  • compare with the better accuracy, FEDformer-f via Fourier Transform


(2) Comparison with Transformers

a) Quantitative results

MTS forecasting

( note: LTSFLinear even does not model correlations among variates )

figure2


UTS forecasting (appendix)

figure2


FEDformer

  • achieves competitive forecasting accuracy on ETTh1.
  • reason) FEDformer employs classical time series analysis techniques such as frequency processing
    • which brings in TS inductive bias & benefits the ability of temporal feature extraction.


Summary

  • (1) existing complex Transformer-based LTSF solutions are not seemingly effective

  • (2) surprising result : naive Repeat outperforms all Transformer-based methods on Exchange-Rate

    \(\rightarrow\) due to wrong prediction of trends in Transformer-based solutions

    • overfit toward sudden change noises in the training data

b) Qualitative results

figure2

the prediction results on 3 TS datasets

  • input length \(L\) = 96
  • output length \(T\) = 336


[ Electricity and ETTh2 ] Transformers fail to capture the scale and bias of the future data

[ Exchange-Rate ] hardly predict a proper trend on aperiodic data


(3) More Analyses on LTSF-Transformers

Q1. Can existing LTSF-Transformers extract temporal relations well from longer input sequences?

Size of the look-back window \(L\)

  • greatly impacts forecasting accuracy


Powerful TSF model with a strong temporal relation extraction capability :

  • larger \(L\), better results!


To study the impact of \(L\)…

  • conduct experiments with \(L \in\) \(\{24,48,72,96,120,144,168,192,336,504,672,720\}\)
  • where \(T\) = 720


figure2

  • existing Transformer-based models’ performance deteriorates or stays stable, when larger \(L\)
  • ( \(\leftrightarrow\) LTSF-Linear : boosted with larger \(L\) )

\(\rightarrow\) Transformers : tend to overfit temporal noises

( thus input size 96 is exactly suitable for most Transformers )


Q2. What can be learned for long-term forecasting?

Hypothesize that long-term forecasting depends on whether models can capture the trend and periodicity well only.

( That is, the farther the forecasting horizon, the less impact the look-back window itself has. )


figure2

Experiment

  • \(T\) = 720 time steps
  • Lookback \(L\) = 96
    • ver 1) original input \(\mathrm{L}=96\) setting (called Close)
    • ver 2) far input \(\mathrm{L}=96\) setting (called Far)


6. Implementation Details

For existing Transformer-based TSF solutions:

  • Autoformer, Informer, and the vanilla Transformer : from Autoformer [28]
  • FEDformer and Pyraformer : from their respective code

( + adopt their default hyper-parameters to train the models )


DLinear

  • MA kernel size fo 25 ( same as Autoformer )
  • # of params
    • Linear : \(T\times L\)
    • NLinear : \(T\times L\)
    • DLinear : \(2\times T\times L\)
  • LTSF-Linear will be underfitting when the \(L\) is small
  • LTSF-Transformers tend to overfit when \(L\) is large


To compare the best performance of existing LTSF-Transformers with LTSF-Linear

  • use \(L=336\) for LTSF-Linear
  • use \(L=96\) for Transformers


7. LTSF & STSF

figure2


8. Distribution shift

Train vs Test

figure2

Categories:

Updated: