MCformer: Multivariate Time Series Forecasting with Mixed-Channels Transformer

Abstract
Introduction
Method

0. Abstract

Channel Dependence (CD)

Treats each channel as a univariate sequence

Channel Independence (CI)

Treats all channels as a single channel
Challenge of interchannel correlation forgetting

MCformer

(MTS forecasting model with mixed channel features)

Innovative Mixed Channels strategy
Combine the
- (1) Data expansion advantages (of the CI strategy)
- (2) Ability to counteract inter-channel correlation forgetting
Details:
- Blends a specific number of channels
- Attention mechanism to effectively capture inter-channel correlation information

1. Introduction

Success of the CI strategy

DLinear: has surpassed existing models !
PatchTST: CI strategy model
- Expanding the dataset and enhancing the model’s generalization capability

Research on PETformer

CI > CD, because multivariate features can interfere with the extraction of long sequence features.
This result goes against intuition, as in DL, more information typically improves model generalization.

Two main reasons why CI >CD

Can expand the dataset to improve the generalization performance of the model ( feat. PatchTST )
Can avoid the destruction of long-term feature information by channel-wise correlation information ( feat. PETformer )

Drawbacks of CI strategy

Overlook inter-channel feature information
With large # of channels, there may be an issue of inter-channel correlation forgetting

Mixed Channels strategy

(CI) Retains the advantages of the CI strategy in expanding the dataset
(CD) Avoiding the disruption of longterm feature information by channels.
Addresses the issue of inter-channel correlation forgetting.

MCformer

Multi-channel TS forecasting model with mixed channel features.
Procedure
- Step 1) Expands the data using the CI strategy
- Step 2) Mixes a specific number of channels
- Step 3) Attention mechanism
  - Capture the correlation information between channels
- Step 4) Encoder result is unflattened to obtain the predicted values of all channels

2. Method

(1) Problem Definition

Input: \(X=\left\{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_t\right\} \in \mathbb{R}^{t \times M}\),

\(\mathbf{x}_t=\left[x_t^1, x_t^2, \ldots, x_t^M\right]^{\top}\).

Target: \(\left\{\mathbf{x}_{t+1}, \ldots, \mathbf{x}_{t+h}\right\}\)

Incorporate a Mixed-Channels Block into the vanilla Transformer Encoder

to expand the dataset
to blend inter-channel dependency information

(2) RevIN

Reversible Instance Normalization (RevIN)

Address the issue of distribution shift.

Before the Mixed Channels module, we apply RevIN to normalize each channel’s data.

Notation

Single channel : \(\mathbf{x}^i=\) \(\left[x_1^i, x_2^i, \ldots, x_t^i\right]\),
- For each instance \(x_t^i\), calculate statistics
\(\operatorname{RevIN}\left(\mathbf{x}^i\right)=\left\{\gamma_i \frac{\mathbf{x}^i-\operatorname{Mean}\left(\mathbf{x}^i\right)}{\sqrt{\operatorname{Var}\left(\mathbf{x}^i\right)+\varepsilon}}\right\}, i=1,2, \cdots, M\).

(3) Mixed-Channels Block

a) Flatten

Channel Independent (CI) strategy to flatten

\(X_F=\) Flatten \((X) \in \mathbb{R}^{t M \times 1}\).

Treated as if it were \(M\) individual samples.

b) Mixed Channels

Combining data from different channels

Procedure

Step 1) Compute Interval Size \(\left\lfloor\frac{M}{m}\right\rfloor\)
- where \(m\) is the number of channels to be mixed.
Step 2) Mixed Channels Operation:
- For a given time step \(t\),
  
  starting from the target channel,
  
  stack every other channel at an interval stride to form \(U^i \in \mathbb{R}^{t \times m}\).

\(\begin{aligned} U^i & =\text { MixedChannels }\left(\mathbf{x}^i, m\right) \\ & =\left[\operatorname{stack}\left(\mathbf{x}^i, C^1, C^2, \ldots, C^m\right)\right] \end{aligned}\).

where \(C^i\) represents the \(i\)-th channel taken at the \(i\)-th interval,
- and \(1 \leq i \leq m\).

c) Patch and Projection

\(\mathcal{P}^i=\operatorname{Projection}\left(\operatorname{Patch}\left(U^i\right)\right)\).

\(\mathcal{P}^i \in \mathbb{R}^{P \times N}\),
- \(P\) : length after projection
- \(N\) : number of patches
  - \(N=\left\lfloor\frac{(L-p)}{S}\right\rfloor+2\),
  - \(p\) : patch length
  - \(S\): stride

Details

[Patching] To aggregate the sequence after mixing channels
[Projection] Single-layer MLP to project channel dependencies as well as adjacent temporal dependencies.

(4) Encoder

native Transformer encoder

does not explicitly model the sequence’s order

\(\rightarrow\) Learnable additive positional encoding \(\mathcal{W}_{\text {pos }} \in \mathbb{R}^{P \times N}\).

\(\mathcal{X}_{i n}^i=\mathcal{P}^i+\mathcal{W}_{\text {pos }}\).

pass

(5) Pseudocode

Twitter Facebook LinkedIn

MCformer; MTS Forecasting with Mixed-Channels Transformer

Seunghan Lee

MCformer: Multivariate Time Series Forecasting with Mixed-Channels Transformer

Contents

0. Abstract

MCformer

1. Introduction

Mixed Channels strategy

MCformer

2. Method

(1) Problem Definition

(2) RevIN

(3) Mixed-Channels Block

a) Flatten

b) Mixed Channels

c) Patch and Projection

(4) Encoder

(5) Pseudocode

You May Also Enjoy