Bi-Mamba+: Bidirectional Mamba for Time Series Forecasting
Contents
- Abstract
- Introduction
- Related Work
    - TSF
- SSM-based models
 
- Methodology
    - Preliminaries
- Overview
- Instance Normalization
- Token Generalization
- Mamba + Block
- Bidirectional Mamba+ Encoder
- Loss Function
 
0. Abstract
Mamba
- (1) “SELECTIVE” capability on input data
- (2) Hardware-aware “PARALLEL” computing algorithm
\(\rightarrow\) Balance predicting a) performance and b) computational efficiency compared to Transformers.
Bi-Mamba+
- 
    Preserve historical information in a longer range 
- 
    Add a “FORGET gate” inside Mamba - Selectively combine the new & historical features
 
- 
    Bi-Mamba+ = Apply Mamba+ both forward and backward 
- 
    Emphasis on both intra- or inter-series dependencies \(\rightarrow\) Propose a “series-relation-aware (SRA) decider” - 
        controls the utilization of - (1) channel-independent or
- (2) channel-mixing
 tokenization strategy for specific datasets. 
 
- 
        
1. Introduction
a) Challenges of Transformers
- 
    Quadratic complexity of the self-attention mechanism \(\rightarrow\) Slow training and inference speeds. 
- 
    Do not explicitly capture the inter-series dependencies 
b) State-space models (SSM)
Promising architecture for sequence modeling
Mamba
- Remarkable results in sequence processing tasks
- Key point: “selective” scanning
\(\rightarrow\) Potentially suitable for the LTSF task
c) Limited utilizations of SSM in LTSF
Stem from the inherent challenges in TS analysis tasks
- (1) Long-term TS modeling.
- (2) Emphasis on intra- or inter-series dependencies.
(1) Long-term time series modeling
- 
    Affected by data non-stationarity, noise and outliers 
- 
    Need for patching.. Why? - 
        Semantic information density of TS data at time points is lower than other types of sequence data! 
- 
        Reduces the number of sequence elements and leads to lower computational complexity 
 
- 
        
- iTransformer (2024)
    - Simple FC layer: to map the whole sequence to hidden states
- Coarse-grained (O), fine-grained (X) evolutionary patterns inside the TS
 
- This paper: model the TS in a “patching” manner
(2) Emphasis on intra- or inter-series dependencies
- 
    Complex correlations btw multiple variables - CI vs. CD … not well established ( differs by datasets )
 
- 
    TimeMachine (Ahamed and Cheng 2024) - 
        Unified structure for - (1) Channel Independent (CI)
- (2) Channel Mixing (CM,CD)
 tokenization strategiess 
- 
        Handle both - (1) intra-series-emphasis
- (2) inter-series-emphasis
 
- 
        Limitation: - Boundary for the selection of tokenization strategies is ambiguous
 
- 
        Statistical characteristics of datasets are overlooked. 
 
- 
        
d) Mamba+
Mamba+ = Improved Mamba block
- 
    Adding a forget gate in Mamba - 
        How: selectively combine the new features & historical features 
- 
        Result: preserve historical information in a longer range 
 
- 
        
Bidirectional Mamba+ (BiMamba+)
- Model the MTS data from both forward and backward
- Result:  Enhancing the …
    - (1) Model’s robustness
- (2) Ability to capture interactions between TS elements
 
- Series-Relation-Aware (SRA) decider
    - Inspired from Spearman coefficient correlation
- Why?? To address the varying emphasis on
        - intra-series evolutionary patterns
- and inter-series interactions
 
- How?? Measures the proportion of highly correlated series pairs in the MTS data
        - to automatically choose CI or CM tokenization strategies.
 
 
- Patch-wise tokens
    - based on the (CI or CM) tokenization strategy
- contain richer semantic information & encourage the model to learn the long-term dependencies of the TS in a finer granularity
 
e) Contributions
- Bi-Mamba+ for LTSF task
    - Improved Mamba+ block &model the MTS data from both forward and backward
 
- SRA decider & Patching
    - (SRA decider) Based on the Spearman correlation coefficient to automatically choose channel independent or channel-mixing tokenization strategies.
- (Patching) To capture long-term dependencies in a finer granularity
 
- Extensive experiments on 8 real-world datasets
2. Related Work
(1) TSF
a) Transformer-based models (Vaswani et al. 2017)
Self-attention mechanism
= Quadratic complexity to the length of the sequence
\(\rightarrow\) Limitation on LTSF
b) Improvment of Transformer
- Informer (Zhou et al. 2021)
    - proposes a ProbSparse mechanism which selects top-k elements of the attention weight matrix to make distillation operation on self-attention.
 
- Autoformer(Wu et al. 2021)
    - uses time series decomposition and proposes an AutoCorrelation mechanism inspired by the stochastic process theory.
 
- Pyraformer(Liu et al. 2021)
    - introduces the pyramidal attention module to summarizes features at different resolutions and model the temporal dependencies of different ranges.
 
- FEDformer(Zhou et al. 2022)
    - develops a frequency enhanced Transformer through frequency domain mapping.
 
- PatchTST(Nie et al. 2023)
    - divides each univariate sequence into patches and uses patch-wise self-attention to model temporal dependencies.
 
- Crossformer(Zhang and Yan 2023)
    - adopts a similar patching operation but additionally employs a Cross-Dimension attention to capture inter-series dependencies.
 
Patching
- helps reduce the number of sequence elements to be processed
- extract richer semantic information
\(\rightarrow\) Still … the self-attention layers are only used on the simplified sequences.
c) iTransformer (Liu et al. 2023)
- 
    Inverts the attention layers to straightly model inter-series dependencies. 
- 
    Limitation - Tokenization approach = Simply passing the whole sequence through MLP
 \(\rightarrow\) Overlooks the complex evolutionary patterns inside the TS 
Transformer-based models still face the challenges in computational efficiency and predicting performance
(2) SSM-based models
a) RNNs
- Process the sequence elements step by step
- Maintain a hidden state
    - updated with each input element
 
- Pros & Cons
    - Pros) Simple and have excellent inference speed
- Cons) limits the training speed and leads to forgetting long-term information
 
b) CNNs
- convolutional kernel to emphasis local information
- Pros & Cons
    - Pros) Parallel computing and faster training speed
- Cons) Limits the inference speed & overlook the long-term global information.
 
c) State Space Models (SSM)
( Inspired by the continious system )
- (Like CNN) Fast training (Trained in parallel)
- (Like RNN) Fast inference
SSM in TSF
- SSDNet (Lin et al. 2021)
    - combines the Transformer architecture with SSM
- provide probabilistic and interpretable forecasts
 
- SPACETIME(Gu et al. 2021b)
    - proposes a new SSM parameterization based on the companion matrix
- enhance the expressivity of the model and introduces a “closed-loop” variation of the companion SSM
 
- Mamba(Gu and Dao 2023)
    - parameterized matrices and a hardware-aware parallel computing algorithm to SSM
 
- 
    S-Mamba (Wang et al. 2024) - explores to use Mamba to capture inter-series dependencies of MTS
- Procedure
        - step 1) embeds each UTS like iTransformer
- step 2) feeds the embeddings into Mamba blocks
 
- Limitation: tokenization approach may overlook the complex evolutionary patterns inside the TS
 
- 
    MambaMixer (Behrouz et al. 2024) - adjusts the Mamba block to bidirectional
- uses two improved blocks to capture inter & intra-series dependencies
- Limitation: gating branch is used to filter new features of both forward and backward directions, which may cause challenges for extracting new features.
 
- 
    TimeMachine (Ahamed and Cheng 2024) - 
        proposes a multi-scale quadruple-Mamba architecture - to unify the handling of CI & CM situations
 
- 
        Limitation: CM & CI strategies are chosen simply based on the length of historical observations and variable number of different datasets. \(\rightarrow\) Characteristics of the MTS data are not fully considered 
 
- 
        
3. Methodology
(1) Preliminaries
Notation
- \(\mathbf{X}_{\text {in }}=\) \(\left[x_1, x_2, \ldots, x_L\right] \in \mathbb{R}^{L \times M}\),
- \(\mathbf{X}_{\text {out }}=\left[x_{L+1}, x_{L+2}, \ldots, x_{L+H}\right] \in \mathbb{R}^{H \times M}\),
State Space Models
- 
    \(h^{\prime}(t)=\mathbf{A} h(t)+\mathbf{B} x(t), \quad y(t)=\mathbf{C} h(t)\). - where \(\mathbf{A} \in \mathbb{R}^{N \times N}, \mathbf{B} \in \mathbb{R}^{D \times N}\) and \(\mathbf{C} \in \mathbb{R}^{N \times D}\).
 
- 
    Notation - \(N\): state expansion factor
- \(D\) : dimension factor
 
- 
    Continuous parameters \(\mathbf{A}, \mathbf{B}\) \(\rightarrow\) Discretized to \(\overline{\mathbf{A}}, \overline{\mathbf{B}}\) - by zero-order holding & time sampling at intervals of \(\Delta\),
 
Discretiztion
\(\begin{aligned} & \overline{\mathbf{A}}=\exp (\Delta \mathbf{A}), \\ & \overline{\mathbf{B}}=(\Delta \mathbf{A})^{-1}(\exp (\Delta \mathbf{A})-\mathbf{I}) \cdot \Delta \mathbf{B} . \end{aligned}\).
Discretized SSM
- 
    \(h_k=\overline{\mathbf{A}} h_{k-1}+\overline{\mathbf{B}} x_k, \quad y_k=\mathbf{C} h_k\). 
- (1) Can be trained in parallel
    - in a convolutional operation way
 
- (2) Efficient inference
    - in a RNN manner
 
HIPPO Matrix (Gu et al. 2020)
- To the initialization of matrix \(\mathbf{A}\)
- Namely the structured state space model (S4) (Gu et al. 2021b)
- Improvement on the ability to model long-term dependencies
Mamba (Gu and Dao 2023)
- 
    Parameterizes the matrices \(\mathbf{B}, \mathbf{C}\) and \(\Delta\) in a data-driven manner 
- 
    Introducing a “selection” mechanism into \(\mathrm{S} 4\) model 
- 
    Uses a novel hardware-aware parallel computing algorithm 
- 
    Linear computational complexity & outstanding capabilities in modeling long-term dependencies 
(2) Overview

Bi-Mamba+
- Step 1) Calculate the tokenization strategy indicator
    - through the SRA decider
 
- Step 2) Divide the TS into patches & generate patch-wise tokens
    - based on the tokenization strategy indicator (CI or CM)
 
- Step 3) Fed into multiple Bi-Mamba+ encoders
- Step 4) Fatten head & linear projector
(3) Instance Normalization
Distribution shift
- Statistical properties of time series data usually change over time
RevIN (Kim et al. 2022)
- Eliminate the non-stationary statistics in the input TS
(4) Token Generalization
a) SRA Decider
Both CI & CD (CM) strategies can achieve SOTA accuracy
- CI wins … with datasets with few variables
- CM wins … with datasets with more variables
\(\rightarrow\) Balance between the emphasis on INTER & INTRA dependencies
SRA decider
- Automatically control the tokenization process

- 
    Step 1) Extract the training set data \(T=\left\{t^1, t^2, \ldots, t^M\right\}\) 
- 
    Step 2) Calculate the Spearman correlation coefficients \(\rho_{i, j}\) - 
        of different series \(t^i\) and \(t^j\) 
- 
        where \(i\) and \(j\) are the indexes of the series ranging from 1 to \(M\) 
 
- 
        
- Step 3) Set threshold \(\lambda\) and 0
    - to filter out series pairs with positive correlation
 
- 
    Step 4) Count the maximum number of relevant series \(\rho_{\max }^\lambda\) and \(\rho_{\max }^0\) 
- 
    Step 5) Calculate the relation ratio \(r=\rho_{\max }^\lambda / \rho_{\max }^0\). 
- 
    Step 6) Select.. - CM strategy to generate sequence tokens for datasets with \(r \geq 1-\lambda\)
- CI strategy otherwise
 
Spearman coefficient
- Nonparametric statistical indicator for evaluating the monotonic relationship between two sequences
b) Tokenization Process
Generalize patch-wise tokens to emphasize capturing local evolutionary patterns of the TS
Procedures
- Step 1) Patch UTS
    - Input: \(x_{1: L}^i\)
- Output:  \(p^i \in \mathbb{R}^{J \times P}\)
        - \(J\) : Total number of patches
- \(P\) : Length of each patch
- \(S\): Stride
 
 
- Step 2-1) CI strategy
    - UTS is concatenated to the tokens \(\mathbb{E}_{\text {ind }} \in \mathbb{R}^{M \times J \times D}\),
        - \(D\) : Hidden state dimension
 
 
- UTS is concatenated to the tokens \(\mathbb{E}_{\text {ind }} \in \mathbb{R}^{M \times J \times D}\),
        
- Step 2-2) CM strategy
    - Group patches with the same index of different series
- Pass each group through the tokenization layer
- Output: \(\mathbb{E}_{\text {mix }} \in\) \(\mathbb{R}^{J \times M \times D}\).
 
(5) Mamba+ Block
a) Mamba block
- 2 branches to process the input features, \(b_1\) & \(b_2\)
    - (branch 1) \(b_1\): Passes the input features through a 1-D CNN & SSM block
- (branch 2) \(b_2\): Passes the input features into a SiLU activation function to serve as a gate
 
- HIPPO matrix (embedded in the SSM block)
    - Retain a fairly long-term historical information
- Still …. the obtained result is filtered directly through the gate of another branch, resulting in the tendency to prioritize proximal information(Wang et al. 2024)
 
- Solution: improved Mamba+ block
    - specifically designed for LTSF.
 
b) Mamba+ block
Add a forget gate
\(\text{gate}_f=1-\text{gate}_{b_2}\),
- \(\text{gate}_f\): Forget gate
- \(\text{gate}_{b_2}\) : Result of sigmoid function in \(b_2\).
\(x^{\prime}\): Output of the 1-D CNN
- Step 1) Multiplied with gate \(_f\)
- Step 2) Added to the filtered result of SSM
\(\text{gate}_f\) & \(\text{gate}_{b_2}\) selectively combine the added new features with the forgotten historical features in a complementary manner.

(6) Bidirectional Mamba+ Encoder

Original Mamba block
- Process 1-D sequence on one direction
Bidirectional Mamba+
- Structure to comprehensively model the MTS
- Encoder Input: \(\mathbb{E}_x^{(l)} \in \mathbb{R}^{B \times W \times D}\)
    - \(l\): encoder layer
- \(B\) and \(W\) corresponds to \(M\) or \(J\) depending on the tokenization strategy.
- If \(t s=1\)
        - \(\mathbb{E}_x^{(l)} \in \mathbb{R}^{J \times M \times D}\) and \(\mathbb{E}_x^{(0)}=\mathbb{E}_{m i x}\),
 
- Else:
        - \(\mathbb{E}_x^{(0)}=\mathbb{E}_{m i x}\) and \(\mathbb{E}_x^{(0)}=\mathbb{E}_{\text {ind }}\).
 
 
- Two Mamba+ blocks in one Bi-Mamba+ encoder
    - to model the input sequence from the forward and backward directions respectively
- \(\mathbb{E}_{x, d i r}^{(l)}\) where dir \(\in\{\) forward, backward \(\}\).
 
- \(\mathbb{E}_x^{(l+1)}=\sum_{d i r}^{\{\text {forward,backward }\}} \mathcal{F}\left(\mathbb{E}_{y, d i r}^{(l)}, \mathbb{E}_{x, d i r}^{(l)}\right)\) .
    - ( = input of the next Bi-Mamba+ encoder layer )
 
(7) Loss Function
MSE
