LLM-Mixer: Multiscale Mixing in LLMs for Time Series Forecasting
Contents
- Abstract
- Introduction
- LLM-Mixer
- Experiments
0. Abstract
LLM-Mixer
-
Combiation of multiscale TS decomposition with pretrained LLMs
-
Captures both
-
a) Short-term fluctuations
-
b) Long-term trends
by decomposing data into multiple temporal resolutions
-
-
Process them with frozen LLM (guided by textual prompt)
1. Introduction & Related Work
Challenges of “LLMs for TS forecasting”
-
(1) Difference bewteen text vs TS
-
(2) TS has “multiple time scales”
-
LLMs typically process fixed length seqeunces
( \(\rightarrow\) only capture short-term dependencies )
-
Solution: LLM-Mixer
- Breaks down TS into mulitple time scales
2. LLM Mixer
(1) Multi-scale View of Time Data
Apply multiscale mixing strategy
Procedure
- Step 1) Downsample into \(\tau\) scales
- Result: multiscale representation \(\mathcal{X}=\left\{\mathbf{x}_0, \mathbf{x}_1, \ldots, \mathbf{x}_\tau\right\}\),
- \(\mathbf{x}_i \in \mathbb{R}^{\frac{T}{2^2} \times M}\).
- \(\mathbf{x}_0\) : the finest temporal details
- \(\mathrm{x}_\tau\) : the broadest trends.
- Result: multiscale representation \(\mathcal{X}=\left\{\mathbf{x}_0, \mathbf{x}_1, \ldots, \mathbf{x}_\tau\right\}\),
- Step 2) Project multiscale series with DNN
- Token embedding
- Temporal embedding
- Positional Embedding
- Step 3) Stacked Past-Decomposable-Mixing (PDM) blocks
- (For the \(l\)-th layer) \(\mathcal{X}^l=P D M\left(\mathcal{X}^{l-1}\right), \quad l \in L\)
- \(\mathcal{X}^l=\) \(\left\{\mathbf{x}_0^l, \mathbf{x}_1^l, \ldots, \mathbf{x}_\tau^l\right\}\), with each \(\mathbf{x}_i^l \in \mathbb{R}^{\frac{T}{2^i} \times d}\),
- (For the \(l\)-th layer) \(\mathcal{X}^l=P D M\left(\mathcal{X}^{l-1}\right), \quad l \in L\)
(2) Prompt Embedding
Prompting
- Effective technique for guiding LLMs
- By using task-specific information
Use textual description for all samples in a dataset
\(\rightarrow\) Generate its embedding using pretrained LLM’s embedding \(E \in \mathbb{R}^{V \times d}\)
(3) Multi-scale Mixing in LLM
(After \(L\) PDM blocks) Obtain the multiscale past information \(\mathcal{X}^L\).
Result: \(\mathbb{F}\left(E \oplus \mathcal{X}^L\right)\).
Apply a trainable decoder (simple linear transformation) to the last hidden layer of the LLM