VCformer: Variable Correlation Transformer with Inherent Lagged Correlation for Multivariate Time Series Forecasting
Contents
- Abstract
- Introduction
- Related Works
- Method
- Background
- Structure Overview
- Variable Correlation Attention
0. Abstract
Vanilla point-wise self-attention mechanism ?? NO!
Variable Correlation Transformer (VCformer)
- Utilizes Variable Correlation Attention (VCA) module
- to mine the correlations among variables
- VCA calculates and integrates the cross-correlation scores corresponding to different lags between queries and keys
- Koopman Temporal Detector (KTD)
- to better address the non-stationarity in TS
\(\rightarrow\) Extract both multivariate correlations and temporal dependencies
https://github.com/CSyyn/VCformer.
1. Introduction
Addressing the limitations of vanilla variable point-wise attention
Variable Correlation Transformer (VCformer)
-
Exploit lagged correlation inherent in MTS
- through the Variable Correlation Attention (VCA) module
-
VCA module
-
calculates the global strength of correlations between each query and key across different feature.
-
Not only computes autocorrelations akin to those in Autoformer
But also extends this concept to determine lagged crosscorrelations among various variates.
-
- ROLL operation + Hadamard products
- to approximate these lagged correlations effectively
- Adaptively aggregates lagged correlation over various lag lengths
- Koopman Temporal Detector (KTD) module
- inspired by Koopman theory in dynamics
Contributions
- VCformer
- Both variable correlations and temporal dependencies of MTS.
- Two things
- Fully exploit lagged correlations among different variates
- KD to effectively address non-stationarity
- SOTA
2. Related Works
CI vs. CD
pass
iTransformer [Liu et al., 2023a]
- revolutionizes the vanilla Transformer
- By inverting the duties of the
- (1) traditional attention mechanism
- (2) feed-forward network
- Roles
- (1) Capturing multivariate correlations
- (2) Learning nonlinear representations
- Adopt the classical self-attention mechanism based on point-wise method, which does not fully exploit the relationship among variable sequences.
3. Method
Input: \(\mathbf{X}=\) \(\left\{\mathbf{x}_1, \ldots, \mathbf{x}_T\right\} \in \mathbb{R}^{T \times N}\)
Target: \(\mathbf{Y}=\left\{\mathbf{x}_{T+1}, \ldots, \mathbf{x}_{T+H}\right\} \in\) \(\mathbb{R}^{H \times N}\)
(1) Background
- Limitation of vanilla variable attention
- in modelling feature-wise dependencies.
- Variable cross-correlation attention mechanism
- operates across the feature channels
- Koopman theory
- Treat TS as dynamics
- KTD module
- Combine it with the variable cross-correlation attention
- To learn both channels and time-steps dependencies
a) Limitation of Vanilla Variable Attnetion
Self-attention module
- employs the linear projections to get \(\mathbf{Q}, \mathbf{K}, \mathbf{V} \in \mathbb{R}^{T \times D}\),
- \(Q=\left[\mathbf{q}_1, \mathbf{q}_2, \ldots, \mathbf{q}_T\right]^{\top}\) ,
- \(K=\left[\mathbf{k}_1, \mathbf{k}_2, \ldots, \mathbf{k}_T\right]^{\top}\),
- Pre-Softmax attention score
- \(\mathbf{A}_{i, j}=\) \(\left(\mathbf{Q K}^{\top} / \sqrt{D}\right)_{i, j} \propto \mathbf{q}_i^{\top} \mathbf{k}_j\).
Nevertheless, feature-wise information,
( where each of the \(D\) features corresponds to an entry of \(\mathbf{q}_i \in \mathbb{R}^{1 \times D}\) or \(\mathbf{k}_j \in \mathbb{R}^{1 \times D}\) )
\(\rightarrow\) Absorbed into such inner-product representation :(
iTransformer [Liu et al., 2023a]
-
inverted Transformer, to capture cross-variable dependencies
- instead computes \(K^{\top} Q \in \mathbb{R}^{D \times D}\).
-
Suitable for capturing instantaneous cross-correlation,
but it is insufficient for MTS data which is coupled with the intrinsic temporal dependencies.
\(\rightarrow\) Variates of MTS data can be correlated with each other, yet with a lag interval!!
( = lagged cross-correlation in MTS analysis [John and Ferbinteanu, 2021; Chandereng and Gitter, 2020; Shen, 2015]. )
b) Non-linear Dynamics Tackled by Koopman Theory
Koopman theory [Koopman, 1931; Brunton et al., 2022]
-
linear dynamical system can be represented as an infinite-dimensional non-linear Koopman operator \(\mathcal{K}\)
-
which operates on a space of measurement functions \(g\), such that..
\(\mathcal{K} \circ g\left(x_t\right)=g\left(\mathbf{F}\left(x_t\right)\right)=g\left(x_{t+1}\right)\).
Dynamic Mode Decomposition(DMD) [Schmid and Sesterhenn, 2008]
- seeks the best fitted matrix \(K\) to approximate infinite-dimensional operator \(\mathcal{K}\) by collecting the observed system states
- Limitation: highly nontrivial to find appropriate measurement functions \(g\) as well as the Koopman operator \(\mathcal{K}\).
Koopman theory serves as a connection between ..
- finite-dimensional nonlinear dynamics
- infinite-dimensional linear dynamics
Proposal: KTD module (to tackle nonlinear dynamics)
- Consider TS data \(\mathbf{X}=\left\{\mathbf{x}_1, \ldots, \mathbf{x}_T\right\}\) as observations of a series of dynamic system states,
- where \(\mathbf{x}_i \in \mathbb{R}^N\) is the system state.
(2) Structure Overview
[1] Following the same Encoder-only structure as iTransformer
\(\rightarrow\) Adopt the Inverted Embedding : \(\mathbb{R}^T \mapsto \mathbb{R}^D\),
- which regards each UTS as the embedded token
[2] Stacking \(L\) layers with VCA and KTD modules
- [VCA] cross-variable relationships
- [KTD] temporal dependencies
[3] Final prediction (by the Projection) \(: \mathbb{R}^D \mapsto \mathbb{R}^H\).
(3) Variable Correlation Attention
a) Lagged Cross-correlation Computing
Stochastic process theory [Chatfield and Xing, 2019]
- Real discrete-time process \(\left\{\mathcal{X}_t\right\}\),
- Autocorrelation \(R_{\mathcal{X}, \mathcal{X}}(\tau)\)
- \(R_{\mathcal{X}, \mathcal{X}}(\tau)=\lim _{L \rightarrow \infty} \frac{1}{L} \sum_{\tau=1}^L \mathcal{X}_t \mathcal{X}_{t-\tau}\).
Approximation for the autocorrelation of variates \(i\) :
- \(R_{\mathbf{q}_i, \mathbf{k}_i}(\tau)=\sum_{\tau=1}^T\left(\mathbf{q}_i\right)_t \cdot\left(\mathbf{k}_i\right)_{t-\tau}=\mathbf{q}_i \odot \operatorname{ROLL}\left(\mathbf{k}_i, \tau\right)\).
- queries \(Q=\left[\mathbf{q}_1, \mathbf{q}_2, \ldots, \mathbf{q}_N\right]\)
- keys \(K=\) \(\left[\mathbf{k}_1, \mathbf{k}_2, \ldots, \mathbf{k}_N\right]\)
- where \(\mathbf{q}_i, \mathbf{k}_j \in \mathbb{R}^{T \times 1}\),
- \(\operatorname{ROLL}\left(\mathbf{k}_i, \tau\right)\): elements of \(\mathbf{k}_i\) shift along the time dimension
This idea was also harnessed in Autoformer [Wu et al., 2021].
Similarly, we can compute the cross-correlation between variate \(i\) and \(j\) by
- \(\text { LAGGED-COR }\left(\mathbf{q}_i, \mathbf{k}_j\right)=\mathbf{q}_i \odot \operatorname{ROLL}\left(\mathbf{k}_j, \tau\right)\).
b) Scores Aggregation
Total correlation of variate \(i\) and \(j\),
= Aggregate different lags \(\tau\) from 1 to \(T\)
( with learnable parameters \(\lambda=\) \(\left[\lambda_1, \lambda_2, \ldots, \lambda_T\right]\) )
- \(\operatorname{COR}\left(\mathbf{q}_i, \mathbf{k}_j\right)=\sum_{\tau=1}^T \lambda_i R_{\mathbf{q}_i, \mathbf{k}_j}(\tau)\).
VCA performs softmax on the learned multivariate correlation map \(\mathbf{A} \in \mathbb{R}^{N \times N}\) at each row and obtains the output via …
- \(\operatorname{VCA}(\mathbf{Q}, \mathbf{K}, \mathbf{V})=\operatorname{SOFTMAX}(\operatorname{COR}(\mathbf{Q}, \mathbf{K})) \mathbf{V}\).
(4) Koopman Temporal Detector (KTD)
Pass
(5) Efficient Computation
Pass