CrossFormer: Transformer Utilizing Cross-dimension Dependency for MTS ForecastingPermalink
ContentsPermalink
- Abstract
- Introduction
- Related Works
- MTS forecasting
- Transformers for MTS forecasting
- Vision Transformers
- Methodology
- Dimension-Segment-Wise Embedding
- Two-Stage Attention Layer
- Hierarchical Encoder-Decoder
- Experiments
- Main Results
- Ablation Study
0. AbstractPermalink
MTS forecasting
- Transformer-based models : can capture long-term dependency
→ however … mainly focus on modeling the temporal (cross-time) dependency
( omit the dependency among different variables ( = cross-dimension dependency ) )
Propose Crossformer
- Transformer-based model utilizing cross-dimension dependency for MTS forecasting.
Details:
-
(step 1) Input MTS is embedded into a 2D vector array
-
via Dimension-Segment-Wise (DSW) embedding
( to preserve time and dimension information )
-
-
(step 2) Two-Stage Attention (TSA) layer
- efficiently capture the cross-time & cross-dimension dependency
using both DSW & TSA, Crossformer establishes a Hierarchical Encoder-Decoder (HED) to use the information at different scales for the final forecasting
1. IntroductionPermalink
Transformer for MTS forecasting:
- ability to capture long-term temporal dependency (cross-time dependency).
Cross-dimension dependency is also critical for MTS forecasting !
-
ex) previous neural models : explicitly capture the cross-dimension dependency, using CNN/GNN
-
However, recent Transformer-based models …
→ only implicitly utilize this dependency by embedding
-
generally embed data points in all dimensions at the same time step into a feature vector
& try to capture dependency among different time steps (Fig. 1 (b))
→ cross-time dependency is well captured, but cross-dimension dependency is not
-
CrossformerPermalink
a Transformer-based model that explicitly utilizes cross-dimension dependency for MTS forecasting.
-
(1) Dimension-Segment-Wise (DSW) embedding
-
to process the historical time series.
-
step 1) series in each dimension is partitioned into segments
-
step 2) embedded into feature vectors
( output = 2D vector array … two axes = time & dimension )
-
-
(2) Two-Stage-Attention (TSA) layer
- efficiently capture the cross-time & cross-dimension dependency
→ using (1) & (2) … establishes a Hierarchical Encoder-Decoder (HED)
-
each layer = each scale
-
[Encoder] upper layer = merges adjacent segments output by the lower layer
( capture coarser scale )
-
[Decoder] generate predictions at different scales & add them up as the final prediction.
ContributionPermalink
-
existing Transformer-based models : cross-dimension dependency is not well utilized
→ Without adequate and explicit mining and utilization of cross-dimension dependency, their forecasting capability is empirically shown limited.
-
develop Crossformer
-
extensive epperimental results
2. Related WorksPermalink
(1) MTS forecastingPermalink
Divided into (1) statistical & (2) neural models
a) StatisticalPermalink
- Vector auto-regressive (VAR) model
- Vector auto-regressive moving average (VARMA)
→ assume linear cross-dimension & cross-time dependency.
b) Neural modelsPermalink
- TCN (Lea et al., 2017) & DeepAR (Flunkert et al., 2017)
- treat the MTS data as a sequence of vectors and use CNN/RNN to capture the temporal dependency
- LSTnet (Lai et al., 2018)
- CNN : for cross-dimension dependency
- RNN : for cross-time dependency
- GNNs
- to capture the cross-dimension dependency explicitly for forecasting
- ex) MTGNN (Wu et al., 2020) :
- temporal convolution : cross-time
- graph convolution : cross-dimension
→ capture the cross-time dependency through CNN or RNN … difficulty in modeling long-term dependency
(2) Transformers for MTS forecastingPermalink
LogTrans (Li et al., 2019b)
- proposes the LogSparse attention ( reduces the computation complexity )
- from O(L2) to O(L(logL)2)
Informer (Zhou et al., 2021)
- utilizes the sparsity of attention score through KL divergence estimation
- proposes ProbSparse self-attention
- achieves O(LlogL) complexity.
Autoformer (Wu et al., 2021a)
- introduces a decomposition architecture with an Auto-Correlation mechanism
- achieves the O(LlogL) complexity.
Pyraformer (Liu et al., 2021a)
- pyramidal attention module
- summarizes features at different resolutions & models the temporal dependencies of different ranges
- complexity of O(L).
FEDformer (Zhou et al., 2022)
- have a sparse representation in frequency domain
- develop a frequency enhanced Transformer
- O(L) complexity.
Preformer (Du et al., 2022)
- divides the embedded feature vector sequence into segments
- utilizes segment-wise correlation-based attention
→ These models mainly focus on reducing the complexity of cross-time dependency modeling,
but omits the **cross-dimension dependency
(3) Vision TransformersPermalink
**ViT (Dosovitskiy et al., 2021) **
-
one of the pioneers of vision transformers.
-
Basic idea of ViT
-
split an image into non-overlapping medium-sized patches
-
rearranges these patches into a sequence ( to be input to the Transformer )
-
Idea of partitiong images into patches
→ inspires our DSW embedding where MTS is split into dimension-wise segments
Swin Transformer (Liu et al., 2021b)
- performs local attention within a window ( to reduce the complexity )
- builds hierarchical feature maps by merging image patches
3. MethodologyPermalink
Notation
- future value : xT+1:T+τ∈ Rτ×D
- input value : x1:T∈RT×D
- number of dimension : D>1
Section 3.1
- embed the MTS using Dimension-Segment-Wise (DSW) embedding
- To utilize the cross-dimension dependency
Section 3.2
- propose a Two-Stage Attention (TSA) layer
- to efficiently capture the dependency among the embedded segments
Section 3.3
- construct a hierarchical encoder-decoder (HED), using DSW embedding and TSA layer
- to utilize information at different scales
(1) Dimension-Segment-Wise EmbeddingPermalink
Embedding of the previous Transformer-based models for MTS forecasting ( Fig. 1 (b) )
- step 1) Embed data points at the same time step into a vector: xt→ht,xt∈RD,ht∈Rdmodel ,
- xt : all the data points in D dimensions at step t.
- input x1:T is embedded into T vectors {h1,h2,…,hT}.
- step 2) Dependency among the T vectors is captured for forecasting.
→ preivous methods mainly capture cross-time dependency
( cross-dimension dependency is not explicitly captured during embedding )
Fig. 1 (a)
- typical attention score map of original Transformer
- attention values have a tendency to segment, i.e. close data points have similar attention weights.
→ Argue that an embedded vector should represent a series segment of single dimension (Fig. 1 (c)), rather than the values of all dimensions at single step (Fig. 1 (b)).
propose Dimension-Segment-Wise (DSW) embedding
[ Step 1 ] Points in each dimension are divided into segments of length Lseg
x1:T={x(s)i,d∣1≤i≤TLseg ,1≤d≤D}x(s)i,d={xt,d∣(i−1)×Lseg <t≤i×Lseg }.
- where x(s)i,d∈RLseg is the i-th segment in dimension d with length Lseg .
[ Step 2 ] Each segment is embedded into a vector
-
using linear projection added with a position embedding
-
hi,d=Ex(s)i,d+E(pos)i,d.
- E∈Rdmodel ×Lseg : the learnable projection matrix
- E(pos )i,d∈Rdmodel : the learnable position embedding for position (i,d).
[ Output ] obtain a 2D vector array H={hi,d∣1≤i≤TLseg ,1≤d≤D},
- each hi,d represents a univariate time series segment.
(2) Two-Stage Attention LayerPermalink
Flatten 2D array H into 1D sequence
→ to be input to Trarnsformer architecture
Specific considerations:
-
(1) Different from images where the axes of height and width are interchangeable……
the axes of time and dimension for MTS have different meanings and thus should be treated differently
-
(2) Directly applying self-attention on 2D array will cause the complexity of O(D2T2L2seg) …..
which is unaffordable for large D.
→ propose the Two-Stage Attention (TSA) Layer
- to capture cross-time and cross-dimension dependency
a) Cross-Time stagePermalink
Notation
-
Input : 2D array Z∈RL×D×dmodel
( = output of DSW embedding or lower TSA layers )
- L : number of segments
- D: number of dimensions
-
Zi,: : vectors of all dimensions at time step i
-
Z:,d : vectors of all time steps in dimension d.
Multi-head self-attention (MSA) to each dimension:
ˆZtime :,d= LayerNorm (Z:,d+MSAtime (Z:,d,Z:,d,Z:,d))Ztime =LayerNorm(ˆZtime +MLP(ˆZtime )).
- where 1≤d≤D
- all dimensions (1≤d≤D) share the same MSA layer
Computation complexity of cross-time stage = O(DL2).
Dependency among time segments in the same dimension is captured in Ztime .
→ Ztime becomes the input of Cross-Dimension Stage
b) Cross-Dimension stagePermalink
Cross-Time stage
-
we can use a large Lseg for long sequence in DSW Embedding
( to reduce the number of segments L in cross-time stage )
Cross-Dimension Stage
-
we can not partition dimensions and directly apply MSA
-
Instead, we propose the router mechanism for potentially large D.
- set a small fixed number (c<<D) of learnable vectors for each time step i as routers.
- complexity : O(D2L)→O(DL)
-
Routers
- step 1) aggregate messages from all dimensions
- step 2) distribute the received messages among dimensions
→ all-to-all connection among D dimensions are built:
Bi,:=MSAdim1(Ri,:,Ztime i,:,Ztime i,:),1≤i≤L¯Zdimi,:=MSAdim2(Ztime i,:,Bi,:,Bi,:),1≤i≤LˆZdim = LayerNorm (Ztime +¯Zdim )Zdim = LayerNorm (ˆZdim +MLP(ˆZdim)).
- R∈RL×c×dmodel : learnable vector array ( = routers )
- B∈RL×c×dmodel : aggregated messages from all dimensions
- ¯Zdim : output of the router mechanism.
- ˆZdim : output of skip connection
- Zdim : output of MLP
All time steps (1≤i≤L) share the same MSAdim1,MSAdim 2.
Y=Zdim =TSA(Z).
- Z : input vector array of TSA layer
- Y∈RL×D×dmodel : output vector array of TSA layer
Overall computation complexity of TSA layer
- O(DL2+DL)=O(DL2).
SummaryPermalink
After the Cross-Time and Cross-Dimension Stages …
every two segments (i.e. Zi1,d1,Zi2,d2 ) in Z are connected,
→ both cross-time and cross-dimension dependencies are captured in Y.
(3) Hierarchical Encoder-DecoderPermalink
to capture information at different scales
use (1) DSW embedding & (2) TSA layer & (3) segment merging
→ to construct a Hierarchical Encoder-Decoder (HED).
- Upper layer utilizes information at a coarser scale for forecasting.
- Final results = add forecasting values at different scales
a) EncoderPermalink
Coarser Level : every two adjacent vectors in time domain are merged
→ then, TSA layer is applied to capture dependency at this scale.
Zenc,l=Encoder(Zenc,l−1).
Details :
{l=1:ˆZenc,l=Hl>1:ˆZenc,li,d=M[Zenc,l−12i−1,d⋅Zenc,l−12i,d],1≤i≤Ll−12,1≤d≤DZenc,l=TSA(ˆZenc,l).
- H : 2D array obtained by DSW embedding
- Zenc,l : the output of the l-th encoder layer
- M∈Rdmodel ×2dmodel : a learnable matrix for segment merging
- Ll−1 : the number of segments in each dimension in layer l−1
- ˆZenc,l : the array after segment merging in the i-th layer.
Suppose there are N layers in the encoder
→ use Zenc ,0,Zenc,1 ,…,Zenc, ,N,(Zenc ,0=H) to represent the N+1 outputs of the encoder.
- complexity of each encoder layer : O(DT2L2seg ).
b) DecoderPermalink
Output of encoder : N+1 feature arrays
Layer l of decoder :
- Input : l-th encoded array
- Output : decoded 2D array of layer l.
Summary : Zdec ,l=Decoder(Zdec ,l−1,Zenc ,l) :
Linear projection is applied to each layer’s output
- to yield the prediction of each layer
Layer predictions
- summed to make the final prediction (for l=0,…,N ):
for l=0,…,N:x(s),li,d=WlZdec, ,li,dxpred,l T+1:T+τ={x(s),li,d∣1≤i≤τLseg ,1≤d≤D}xpred T+1:T+τ=N∑l=0xpred, T+1:T+τ.
- Wl∈RLseg×dmodel : a learnable matrix to project a vector to a time series segment.
- x(s),li,d∈RLseg : the i-th segment in dimension d of the prediction