CrossFormer: Transformer Utilizing Cross-dimension Dependency for MTS Forecasting

Abstract
Introduction
Related Works
1. MTS forecasting
2. Transformers for MTS forecasting
3. Vision Transformers
Methodology
1. Dimension-Segment-Wise Embedding
2. Two-Stage Attention Layer
3. Hierarchical Encoder-Decoder
Experiments
1. Main Results
2. Ablation Study

0. Abstract

MTS forecasting

Transformer-based models : can capture long-term dependency

\(\rightarrow\) however … mainly focus on modeling the temporal (cross-time) dependency

( omit the dependency among different variables ( = cross-dimension dependency ) )

Propose Crossformer

Transformer-based model utilizing cross-dimension dependency for MTS forecasting.

Details:

(step 1) Input MTS is embedded into a 2D vector array
- via Dimension-Segment-Wise (DSW) embedding
  
  ( to preserve time and dimension information )
(step 2) Two-Stage Attention (TSA) layer
- efficiently capture the cross-time & cross-dimension dependency

using both DSW & TSA, Crossformer establishes a Hierarchical Encoder-Decoder (HED) to use the information at different scales for the final forecasting

1. Introduction

Transformer for MTS forecasting:

ability to capture long-term temporal dependency (cross-time dependency).

Cross-dimension dependency is also critical for MTS forecasting !

ex) previous neural models : explicitly capture the cross-dimension dependency, using CNN/GNN
However, recent Transformer-based models …

\(\rightarrow\) only implicitly utilize this dependency by embedding
- generally embed data points in all dimensions at the same time step into a feature vector
  
  & try to capture dependency among different time steps (Fig. 1 (b))
  
  \(\rightarrow\) cross-time dependency is well captured, but cross-dimension dependency is not

Crossformer

a Transformer-based model that explicitly utilizes cross-dimension dependency for MTS forecasting.

(1) Dimension-Segment-Wise (DSW) embedding
- to process the historical time series.
- step 1) series in each dimension is partitioned into segments
- step 2) embedded into feature vectors
  
  ( output = 2D vector array … two axes = time & dimension )
(2) Two-Stage-Attention (TSA) layer
- efficiently capture the cross-time & cross-dimension dependency

\(\rightarrow\) using (1) & (2) … establishes a Hierarchical Encoder-Decoder (HED)

each layer = each scale
[Encoder] upper layer = merges adjacent segments output by the lower layer

( capture coarser scale )
[Decoder] generate predictions at different scales & add them up as the final prediction.

Contribution

existing Transformer-based models : cross-dimension dependency is not well utilized

\(\rightarrow\) Without adequate and explicit mining and utilization of cross-dimension dependency, their forecasting capability is empirically shown limited.
develop Crossformer
extensive epperimental results

(1) MTS forecasting

Divided into (1) statistical & (2) neural models

a) Statistical

Vector auto-regressive (VAR) model
Vector auto-regressive moving average (VARMA)

\(\rightarrow\) assume linear cross-dimension & cross-time dependency.

b) Neural models

TCN (Lea et al., 2017) & DeepAR (Flunkert et al., 2017)
- treat the MTS data as a sequence of vectors and use CNN/RNN to capture the temporal dependency
LSTnet (Lai et al., 2018)
- CNN : for cross-dimension dependency
- RNN : for cross-time dependency
GNNs
- to capture the cross-dimension dependency explicitly for forecasting
- ex) MTGNN (Wu et al., 2020) :
  - temporal convolution : cross-time
  - graph convolution : cross-dimension

\(\rightarrow\) capture the cross-time dependency through CNN or RNN … difficulty in modeling long-term dependency

(2) Transformers for MTS forecasting

LogTrans (Li et al., 2019b)

proposes the LogSparse attention ( reduces the computation complexity )
- from \(O\left(L^2\right)\) to \(O\left(L(\log L)^2\right)\)

Informer (Zhou et al., 2021)

utilizes the sparsity of attention score through KL divergence estimation
proposes ProbSparse self-attention
- achieves \(O(L \log L)\) complexity.

Autoformer (Wu et al., 2021a)

introduces a decomposition architecture with an Auto-Correlation mechanism
- achieves the \(O(L \log L)\) complexity.

Pyraformer (Liu et al., 2021a)

pyramidal attention module
summarizes features at different resolutions & models the temporal dependencies of different ranges
- complexity of \(O(L)\).

FEDformer (Zhou et al., 2022)

have a sparse representation in frequency domain
develop a frequency enhanced Transformer
- \(O(L)\) complexity.

Preformer (Du et al., 2022)

divides the embedded feature vector sequence into segments
utilizes segment-wise correlation-based attention

\(\rightarrow\) These models mainly focus on reducing the complexity of cross-time dependency modeling,

but omits the **cross-dimension dependency

(3) Vision Transformers

**ViT (Dosovitskiy et al., 2021) **

one of the pioneers of vision transformers.
Basic idea of ViT
- split an image into non-overlapping medium-sized patches
- rearranges these patches into a sequence ( to be input to the Transformer )

Idea of partitiong images into patches

\(\rightarrow\) inspires our DSW embedding where MTS is split into dimension-wise segments

Swin Transformer (Liu et al., 2021b)

performs local attention within a window ( to reduce the complexity )
builds hierarchical feature maps by merging image patches

3. Methodology

Notation

future value : \(\mathbf{x}_{T+1: T+\tau} \in\) \(\mathbb{R}^{\tau \times D}\)
input value : \(\mathbf{x}_{1: T} \in \mathbb{R}^{T \times D}\)
number of dimension : \(D>1\)

Section 3.1

embed the MTS using Dimension-Segment-Wise (DSW) embedding
To utilize the cross-dimension dependency

Section 3.2

propose a Two-Stage Attention (TSA) layer
to efficiently capture the dependency among the embedded segments

Section 3.3

construct a hierarchical encoder-decoder (HED), using DSW embedding and TSA layer
to utilize information at different scales

(1) Dimension-Segment-Wise Embedding

Embedding of the previous Transformer-based models for MTS forecasting ( Fig. 1 (b) )

step 1) Embed data points at the same time step into a vector: \(\mathbf{x}_t \rightarrow \mathbf{h}_t, \mathbf{x}_t \in \mathbb{R}^D, \mathbf{h}_t \in \mathbb{R}^{d_{\text {model }}}\),
- \(\mathbf{x}_t\) : all the data points in \(D\) dimensions at step \(t\).
- input \(\mathbf{x}_{1: T}\) is embedded into \(T\) vectors \(\left\{\mathbf{h}_1, \mathbf{h}_2, \ldots, \mathbf{h}_T\right\}\).
step 2) Dependency among the \(T\) vectors is captured for forecasting.

\(\rightarrow\) preivous methods mainly capture cross-time dependency

( cross-dimension dependency is not explicitly captured during embedding )

Fig. 1 (a)

typical attention score map of original Transformer
attention values have a tendency to segment, i.e. close data points have similar attention weights.

\(\rightarrow\) Argue that an embedded vector should represent a series segment of single dimension (Fig. 1 (c)), rather than the values of all dimensions at single step (Fig. 1 (b)).

propose Dimension-Segment-Wise (DSW) embedding

[ Step 1 ] Points in each dimension are divided into segments of length \(L_{s e g}\)

\(\begin{aligned} \mathbf{x}_{1: T} & =\left\{\mathbf{x}_{i, d}^{(s)} \mid 1 \leq i \leq \frac{T}{L_{\text {seg }}}, 1 \leq d \leq D\right\} \\ \mathbf{x}_{i, d}^{(s)} & =\left\{x_{t, d} \mid(i-1) \times L_{\text {seg }}<t \leq i \times L_{\text {seg }}\right\} \end{aligned}\).

where \(\mathbf{x}_{i, d}^{(s)} \in \mathbb{R}^{L_{\text {seg }}}\) is the \(i\)-th segment in dimension \(d\) with length \(L_{\text {seg }}\).

[ Step 2 ] Each segment is embedded into a vector

using linear projection added with a position embedding
\(\mathbf{h}_{i, d}=\mathbf{E} \mathbf{x}_{i, d}^{(s)}+\mathbf{E}_{i, d}^{(p o s)}\).
- \(\mathbf{E} \in \mathbb{R}^{d_{\text {model }} \times L_{\text {seg }}}\) : the learnable projection matrix
- \(\mathbf{E}_{i, d}^{(\text {pos })} \in \mathbb{R}^{d_{\text {model }}}\) : the learnable position embedding for position \((i, d)\).

[ Output ] obtain a \(2 \mathrm{D}\) vector array \(\mathbf{H}=\left\{\mathbf{h}_{i, d} \mid 1 \leq i \leq \frac{T}{L_{\text {seg }}}, 1 \leq d \leq D\right\}\),

each \(\mathbf{h}_{i, d}\) represents a univariate time series segment.

(2) Two-Stage Attention Layer

Flatten 2D array \(\mathbf{H}\) into 1D sequence

\(\rightarrow\) to be input to Trarnsformer architecture

Specific considerations:

(1) Different from images where the axes of height and width are interchangeable……

the axes of time and dimension for MTS have different meanings and thus should be treated differently
(2) Directly applying self-attention on \(2 \mathrm{D}\) array will cause the complexity of \(O\left(D^2 \frac{T^2}{L_{s e g}^2}\right)\) …..

which is unaffordable for large \(D\).

\(\rightarrow\) propose the Two-Stage Attention (TSA) Layer

to capture cross-time and cross-dimension dependency

a) Cross-Time stage

Notation

Input : 2D array \(\mathbf{Z} \in \mathbb{R}^{L \times D \times d_{\text {model }}}\)

( = output of DSW embedding or lower TSA layers )
- \(L\) : number of segments
- \(D\): number of dimensions
\(\mathbf{Z}_{i,:}\) : vectors of all dimensions at time step \(i\)
\(\mathbf{Z}_{:, d}\) : vectors of all time steps in dimension \(d\).

Multi-head self-attention (MSA) to each dimension:

\(\begin{aligned} \hat{\mathbf{Z}}_{:, d}^{\text {time }} & =\text { LayerNorm }\left(\mathbf{Z}_{:, d}+\operatorname{MSA}^{\text {time }}\left(\mathbf{Z}_{:, d}, \mathbf{Z}_{:, d}, \mathbf{Z}_{:, d}\right)\right) \\ \mathbf{Z}^{\text {time }} & =\operatorname{LayerNorm}\left(\hat{\mathbf{Z}}^{\text {time }}+\operatorname{MLP}\left(\hat{\mathbf{Z}}^{\text {time }}\right)\right) \end{aligned}\).

where \(1 \leq d \leq D\)
all dimensions \((1 \leq d \leq D)\) share the same MSA layer

Computation complexity of cross-time stage = \(O\left(D L^2\right)\).

Dependency among time segments in the same dimension is captured in \(\mathbf{Z}^{\text {time }}\).

\(\rightarrow\) \(\mathbf{Z}^{\text {time }}\) becomes the input of Cross-Dimension Stage

b) Cross-Dimension stage

Cross-Time stage

we can use a large \(L_{\text {seg }}\) for long sequence in DSW Embedding

( to reduce the number of segments \(L\) in cross-time stage )

Cross-Dimension Stage

we can not partition dimensions and directly apply MSA
Instead, we propose the router mechanism for potentially large \(D\).
- set a small fixed number \((c<<D)\) of learnable vectors for each time step \(i\) as routers.
- complexity : \(O(D^2L) \rightarrow O(DL)\)
Routers
- step 1) aggregate messages from all dimensions
- step 2) distribute the received messages among dimensions
\(\rightarrow\) all-to-all connection among \(D\) dimensions are built:

\(\begin{aligned} \mathbf{B}_{i,:} & =\operatorname{MSA}_1^{d i m}\left(\mathbf{R}_{i,:}, \mathbf{Z}_{i,:}^{\text {time }}, \mathbf{Z}_{i,:}^{\text {time }}\right), 1 \leq i \leq L \\ \overline{\mathbf{Z}}_{i,:}^{d i m} & =\operatorname{MSA}_2^{d i m}\left(\mathbf{Z}_{i,:}^{\text {time }}, \mathbf{B}_{i,:}, \mathbf{B}_{i,:}\right), 1 \leq i \leq L \\ \hat{\mathbf{Z}}^{\text {dim }} & =\text { LayerNorm }\left(\mathbf{Z}^{\text {time }}+\overline{\mathbf{Z}}^{\text {dim }}\right) \\ \mathbf{Z}^{\text {dim }} & =\text { LayerNorm }\left(\hat{\mathbf{Z}}^{\text {dim }}+\operatorname{MLP}\left(\hat{\mathbf{Z}}^{d i m}\right)\right) \end{aligned}\).

\(\mathbf{R} \in \mathbb{R}^{L \times c \times d_{\text {model }}}\) : learnable vector array ( = routers )
\(\mathbf{B} \in \mathbb{R}^{L \times c \times d_{\text {model }}}\) : aggregated messages from all dimensions
\(\overline{\mathbf{Z}}^{\text {dim }}\) : output of the router mechanism.
\(\hat{\mathbf{Z}}^{\text {dim }}\) : output of skip connection
\(\mathbf{Z}^{\text {dim }}\) : output of MLP

All time steps \((1 \leq i \leq L)\) share the same \(\mathrm{MSA}_1^{d i m}, \mathrm{MSA}_2^{\text {dim }}\).

\(\mathbf{Y}=\mathbf{Z}^{\text {dim }}=\mathrm{TSA}(\mathbf{Z})\).

\(\mathbf{Z}\) : input vector array of TSA layer
\(\mathbf{Y} \in \mathbb{R}^{L \times D \times d_{\text {model }}}\) : output vector array of TSA layer

Overall computation complexity of TSA layer

\(O\left(D L^2+D L\right)=O\left(D L^2\right)\).

Summary

After the Cross-Time and Cross-Dimension Stages …

every two segments (i.e. \(\mathbf{Z}_{i_1, d_1}, \mathbf{Z}_{i_2, d_2}\) ) in \(\mathbf{Z}\) are connected,

\(\rightarrow\) both cross-time and cross-dimension dependencies are captured in \(\mathbf{Y}\).

(3) Hierarchical Encoder-Decoder

to capture information at different scales

use (1) DSW embedding & (2) TSA layer & (3) segment merging

\(\rightarrow\) to construct a Hierarchical Encoder-Decoder (HED).

Upper layer utilizes information at a coarser scale for forecasting.
Final results = add forecasting values at different scales

a) Encoder

Coarser Level : every two adjacent vectors in time domain are merged

\(\rightarrow\) then, TSA layer is applied to capture dependency at this scale.

\(\mathbf{Z}^{e n c, l}=\operatorname{Encoder}\left(\mathbf{Z}^{e n c, l-1}\right)\).

Details :

\(\begin{aligned} & \left\{\begin{aligned} l=1: & \hat{\mathbf{Z}}^{e n c, l}=\mathbf{H} \\ l>1: & \hat{\mathbf{Z}}_{i, d}^{e n c, l}=\mathbf{M}\left[\mathbf{Z}_{2 i-1, d}^{e n c, l-1} \cdot \mathbf{Z}_{2 i, d}^{e n c, l-1}\right], 1 \leq i \leq \frac{L_{l-1}}{2}, 1 \leq d \leq D \end{aligned}\right. \\ & \mathbf{Z}^{e n c, l}=\operatorname{TSA}\left(\hat{\mathbf{Z}}^{e n c, l}\right) \\ & \end{aligned}\).

\(\mathbf{H}\) : \(2 \mathrm{D}\) array obtained by DSW embedding
\(\mathbf{Z}^{\text {enc,l }}\) : the output of the \(l\)-th encoder layer
\(\mathbf{M} \in \mathbb{R}^{d_{\text {model }} \times 2 d_{\text {model }}}\) : a learnable matrix for segment merging
\(L_{l-1}\) : the number of segments in each dimension in layer \(l-1\)
\(\hat{\mathbf{Z}}^{e n c, l}\) : the array after segment merging in the \(i\)-th layer.

Suppose there are \(N\) layers in the encoder

\(\rightarrow\) use \(\mathbf{Z}^{\text {enc }, 0}, \mathbf{Z}^{\text {enc,1 }}, \ldots, \mathbf{Z}^{\text {enc, }, N},\left(\mathbf{Z}^{\text {enc }, 0}=\mathbf{H}\right)\) to represent the \(N+1\) outputs of the encoder.

complexity of each encoder layer : \(O\left(D \frac{T^2}{L_{\text {seg }}^2}\right)\).

b) Decoder

Output of encoder : \(N+1\) feature arrays

Layer \(l\) of decoder :

Input : \(l\)-th encoded array
Output : decoded \(2 \mathrm{D}\) array of layer \(l\).

Summary : \(\mathbf{Z}^{\text {dec }, l}=\operatorname{Decoder}\left(\mathbf{Z}^{\text {dec }, l-1}, \mathbf{Z}^{\text {enc }, l}\right)\) :

Linear projection is applied to each layer’s output

to yield the prediction of each layer

Layer predictions

summed to make the final prediction (for \(l=0, \ldots, N\) ):

\(\begin{gathered} \text { for } l=0, \ldots, N: \mathbf{x}_{i, d}^{(s), l}=\mathbf{W}^l \mathbf{Z}_{i, d}^{\text {dec, }, l} \quad \mathbf{x}_{T+1: T+\tau}^{\text {pred,l }}=\left\{\mathbf{x}_{i, d}^{(s), l} \mid 1 \leq i \leq \frac{\tau}{L_{\text {seg }}}, 1 \leq d \leq D\right\} \\ \mathbf{x}_{T+1: T+\tau}^{\text {pred }}=\sum_{l=0}^N \mathbf{x}_{T+1: T+\tau}^{\text {pred, }} \end{gathered}\).

\(\mathbf{W}^l \in \mathbb{R}^{L_{s e g} \times d_{\text {model }}}\) : a learnable matrix to project a vector to a time series segment.
\(\mathbf{x}_{i, d}^{(s), l} \in \mathbb{R}^{L_{s e g}}\) : the \(i\)-th segment in dimension \(d\) of the prediction

4. Experiments

(1) Main Results

(2) Ablation Study

Twitter Facebook LinkedIn

(paper 76) CrossFormer; Transformer Utilizing Cross-dimension Dependency for MTS Forecasting

Seunghan Lee

CrossFormer: Transformer Utilizing Cross-dimension Dependency for MTS Forecasting

Contents

0. Abstract