CrossFormer: Transformer Utilizing Cross-dimension Dependency for MTS ForecastingPermalink


ContentsPermalink

  1. Abstract
  2. Introduction
  3. Related Works
    1. MTS forecasting
    2. Transformers for MTS forecasting
    3. Vision Transformers
  4. Methodology
    1. Dimension-Segment-Wise Embedding
    2. Two-Stage Attention Layer
    3. Hierarchical Encoder-Decoder
  5. Experiments
    1. Main Results
    2. Ablation Study


0. AbstractPermalink

MTS forecasting

  • Transformer-based models : can capture long-term dependency

however … mainly focus on modeling the temporal (cross-time) dependency

( omit the dependency among different variables ( = cross-dimension dependency ) )


Propose Crossformer

  • Transformer-based model utilizing cross-dimension dependency for MTS forecasting.


Details:

  • (step 1) Input MTS is embedded into a 2D vector array

    • via Dimension-Segment-Wise (DSW) embedding

      ( to preserve time and dimension information )

  • (step 2) Two-Stage Attention (TSA) layer

    • efficiently capture the cross-time & cross-dimension dependency

using both DSW & TSA, Crossformer establishes a Hierarchical Encoder-Decoder (HED) to use the information at different scales for the final forecasting


1. IntroductionPermalink

Transformer for MTS forecasting:

  • ability to capture long-term temporal dependency (cross-time dependency).


Cross-dimension dependency is also critical for MTS forecasting !

  • ex) previous neural models : explicitly capture the cross-dimension dependency, using CNN/GNN

  • However, recent Transformer-based models …

    only implicitly utilize this dependency by embedding

    • generally embed data points in all dimensions at the same time step into a feature vector

      & try to capture dependency among different time steps (Fig. 1 (b))

      cross-time dependency is well captured, but cross-dimension dependency is not


CrossformerPermalink

a Transformer-based model that explicitly utilizes cross-dimension dependency for MTS forecasting.

  • (1) Dimension-Segment-Wise (DSW) embedding

    • to process the historical time series.

    • step 1) series in each dimension is partitioned into segments

    • step 2) embedded into feature vectors

      ( output = 2D vector array … two axes = time & dimension )

  • (2) Two-Stage-Attention (TSA) layer

    • efficiently capture the cross-time & cross-dimension dependency

using (1) & (2) … establishes a Hierarchical Encoder-Decoder (HED)

  • each layer = each scale

  • [Encoder] upper layer = merges adjacent segments output by the lower layer

    ( capture coarser scale )

  • [Decoder] generate predictions at different scales & add them up as the final prediction.


ContributionPermalink

  • existing Transformer-based models : cross-dimension dependency is not well utilized

    Without adequate and explicit mining and utilization of cross-dimension dependency, their forecasting capability is empirically shown limited.

  • develop Crossformer

  • extensive epperimental results


2. Related WorksPermalink

(1) MTS forecastingPermalink

Divided into (1) statistical & (2) neural models


a) StatisticalPermalink

  • Vector auto-regressive (VAR) model
  • Vector auto-regressive moving average (VARMA)

assume linear cross-dimension & cross-time dependency.


b) Neural modelsPermalink

  • TCN (Lea et al., 2017) & DeepAR (Flunkert et al., 2017)
    • treat the MTS data as a sequence of vectors and use CNN/RNN to capture the temporal dependency
  • LSTnet (Lai et al., 2018)
    • CNN : for cross-dimension dependency
    • RNN : for cross-time dependency
  • GNNs
    • to capture the cross-dimension dependency explicitly for forecasting
    • ex) MTGNN (Wu et al., 2020) :
      • temporal convolution : cross-time
      • graph convolution : cross-dimension

capture the cross-time dependency through CNN or RNN … difficulty in modeling long-term dependency


(2) Transformers for MTS forecastingPermalink

LogTrans (Li et al., 2019b)

  • proposes the LogSparse attention ( reduces the computation complexity )
    • from O(L2) to O(L(logL)2)

Informer (Zhou et al., 2021)

  • utilizes the sparsity of attention score through KL divergence estimation
  • proposes ProbSparse self-attention
    • achieves O(LlogL) complexity.

Autoformer (Wu et al., 2021a)

  • introduces a decomposition architecture with an Auto-Correlation mechanism
    • achieves the O(LlogL) complexity.

Pyraformer (Liu et al., 2021a)

  • pyramidal attention module
  • summarizes features at different resolutions & models the temporal dependencies of different ranges
    • complexity of O(L).

FEDformer (Zhou et al., 2022)

  • have a sparse representation in frequency domain
  • develop a frequency enhanced Transformer
    • O(L) complexity.

Preformer (Du et al., 2022)

  • divides the embedded feature vector sequence into segments
  • utilizes segment-wise correlation-based attention


These models mainly focus on reducing the complexity of cross-time dependency modeling,

but omits the **cross-dimension dependency


(3) Vision TransformersPermalink

**ViT (Dosovitskiy et al., 2021) **

  • one of the pioneers of vision transformers.

  • Basic idea of ViT

    • split an image into non-overlapping medium-sized patches

    • rearranges these patches into a sequence ( to be input to the Transformer )


Idea of partitiong images into patches

inspires our DSW embedding where MTS is split into dimension-wise segments


Swin Transformer (Liu et al., 2021b)

  • performs local attention within a window ( to reduce the complexity )
  • builds hierarchical feature maps by merging image patches


3. MethodologyPermalink

Notation

  • future value : xT+1:T+τ Rτ×D
  • input value : x1:TRT×D
  • number of dimension : D>1


figure2


Section 3.1

  • embed the MTS using Dimension-Segment-Wise (DSW) embedding
  • To utilize the cross-dimension dependency


Section 3.2

  • propose a Two-Stage Attention (TSA) layer
  • to efficiently capture the dependency among the embedded segments


Section 3.3

  • construct a hierarchical encoder-decoder (HED), using DSW embedding and TSA layer
  • to utilize information at different scales


(1) Dimension-Segment-Wise EmbeddingPermalink

Embedding of the previous Transformer-based models for MTS forecasting ( Fig. 1 (b) )

  • step 1) Embed data points at the same time step into a vector: xtht,xtRD,htRdmodel ,
    • xt : all the data points in D dimensions at step t.
    • input x1:T is embedded into T vectors {h1,h2,,hT}.
  • step 2) Dependency among the T vectors is captured for forecasting.

preivous methods mainly capture cross-time dependency

( cross-dimension dependency is not explicitly captured during embedding )


Fig. 1 (a)

  • typical attention score map of original Transformer
  • attention values have a tendency to segment, i.e. close data points have similar attention weights.

Argue that an embedded vector should represent a series segment of single dimension (Fig. 1 (c)), rather than the values of all dimensions at single step (Fig. 1 (b)).


propose Dimension-Segment-Wise (DSW) embedding

[ Step 1 ] Points in each dimension are divided into segments of length Lseg

x1:T={x(s)i,d1iTLseg ,1dD}x(s)i,d={xt,d(i1)×Lseg <ti×Lseg }.

  • where x(s)i,dRLseg  is the i-th segment in dimension d with length Lseg .


[ Step 2 ] Each segment is embedded into a vector

  • using linear projection added with a position embedding

  • hi,d=Ex(s)i,d+E(pos)i,d.

    • ERdmodel ×Lseg  : the learnable projection matrix
    • E(pos )i,dRdmodel  : the learnable position embedding for position (i,d).


[ Output ] obtain a 2D vector array H={hi,d1iTLseg ,1dD},

  • each hi,d represents a univariate time series segment.


(2) Two-Stage Attention LayerPermalink

Flatten 2D array H into 1D sequence

to be input to Trarnsformer architecture


Specific considerations:

  • (1) Different from images where the axes of height and width are interchangeable……

    the axes of time and dimension for MTS have different meanings and thus should be treated differently

  • (2) Directly applying self-attention on 2D array will cause the complexity of O(D2T2L2seg) …..

    which is unaffordable for large D.

propose the Two-Stage Attention (TSA) Layer

  • to capture cross-time and cross-dimension dependency


a) Cross-Time stagePermalink

Notation

  • Input : 2D array ZRL×D×dmodel 

    ( = output of DSW embedding or lower TSA layers )

    • L : number of segments
    • D: number of dimensions
  • Zi,: : vectors of all dimensions at time step i

  • Z:,d : vectors of all time steps in dimension d.


Multi-head self-attention (MSA) to each dimension:

ˆZtime :,d= LayerNorm (Z:,d+MSAtime (Z:,d,Z:,d,Z:,d))Ztime =LayerNorm(ˆZtime +MLP(ˆZtime )).

  • where 1dD
  • all dimensions (1dD) share the same MSA layer


Computation complexity of cross-time stage = O(DL2).

Dependency among time segments in the same dimension is captured in Ztime .

Ztime  becomes the input of Cross-Dimension Stage


b) Cross-Dimension stagePermalink

Cross-Time stage

  • we can use a large Lseg  for long sequence in DSW Embedding

    ( to reduce the number of segments L in cross-time stage )


Cross-Dimension Stage

  • we can not partition dimensions and directly apply MSA

  • Instead, we propose the router mechanism for potentially large D.

    • set a small fixed number (c<<D) of learnable vectors for each time step i as routers.
    • complexity : O(D2L)O(DL)
  • Routers

    • step 1) aggregate messages from all dimensions
    • step 2) distribute the received messages among dimensions

    all-to-all connection among D dimensions are built:


Bi,:=MSAdim1(Ri,:,Ztime i,:,Ztime i,:),1iL¯Zdimi,:=MSAdim2(Ztime i,:,Bi,:,Bi,:),1iLˆZdim = LayerNorm (Ztime +¯Zdim )Zdim = LayerNorm (ˆZdim +MLP(ˆZdim)).

  • RRL×c×dmodel  : learnable vector array ( = routers )
  • BRL×c×dmodel  : aggregated messages from all dimensions
  • ¯Zdim  : output of the router mechanism.
  • ˆZdim  : output of skip connection
  • Zdim  : output of MLP

All time steps (1iL) share the same MSAdim1,MSAdim 2.


Y=Zdim =TSA(Z).

  • Z : input vector array of TSA layer
  • YRL×D×dmodel  : output vector array of TSA layer


Overall computation complexity of TSA layer

  • O(DL2+DL)=O(DL2).


SummaryPermalink

After the Cross-Time and Cross-Dimension Stages …

every two segments (i.e. Zi1,d1,Zi2,d2 ) in Z are connected,

both cross-time and cross-dimension dependencies are captured in Y.


(3) Hierarchical Encoder-DecoderPermalink

to capture information at different scales

use (1) DSW embedding & (2) TSA layer & (3) segment merging

to construct a Hierarchical Encoder-Decoder (HED).


figure2

  • Upper layer utilizes information at a coarser scale for forecasting.
  • Final results = add forecasting values at different scales


a) EncoderPermalink

Coarser Level : every two adjacent vectors in time domain are merged

then, TSA layer is applied to capture dependency at this scale.

Zenc,l=Encoder(Zenc,l1).


Details :

{l=1:ˆZenc,l=Hl>1:ˆZenc,li,d=M[Zenc,l12i1,dZenc,l12i,d],1iLl12,1dDZenc,l=TSA(ˆZenc,l).

  • H : 2D array obtained by DSW embedding
  • Zenc,l  : the output of the l-th encoder layer
  • MRdmodel ×2dmodel  : a learnable matrix for segment merging
  • Ll1 : the number of segments in each dimension in layer l1
  • ˆZenc,l : the array after segment merging in the i-th layer.


Suppose there are N layers in the encoder

use Zenc ,0,Zenc,1 ,,Zenc, ,N,(Zenc ,0=H) to represent the N+1 outputs of the encoder.

  • complexity of each encoder layer : O(DT2L2seg ).


b) DecoderPermalink

Output of encoder : N+1 feature arrays

Layer l of decoder :

  • Input : l-th encoded array
  • Output : decoded 2D array of layer l.

Summary : Zdec ,l=Decoder(Zdec ,l1,Zenc ,l) :


Linear projection is applied to each layer’s output

  • to yield the prediction of each layer


Layer predictions

  • summed to make the final prediction (for l=0,,N ):

 for l=0,,N:x(s),li,d=WlZdec, ,li,dxpred,l T+1:T+τ={x(s),li,d1iτLseg ,1dD}xpred T+1:T+τ=Nl=0xpred, T+1:T+τ.

  • WlRLseg×dmodel  : a learnable matrix to project a vector to a time series segment.
  • x(s),li,dRLseg : the i-th segment in dimension d of the prediction


4. ExperimentsPermalink

(1) Main ResultsPermalink

figure2


(2) Ablation StudyPermalink

figure2

Categories: ,

Updated: