Ti-MAE : Self-Supervised Masked Time Series Auto Encoders
( https://openreview.net/pdf?id=9AuIMiZhkL2 )
Contents
- Abstract
- Introduction
- Related Works
- Methodology
- Problem Definition
- Model Architecture
- Experiments
- Experimental Setup
- TS Forecasting
- TS Classification
1. Abstract
Contrastive learning & Transformer-based models : good performance on long-term TS forecasting
Problems :
-
(1) Contrastive learning :
- training paradigm of contrastive learning and downstream prediction tasks are inconsistent
-
(2) Transformer-based models :
-
resort to similar patterns in historical time series data for predicting future values
\(\rightarrow\) induce severe distribution shift problems
-
do not fully leverage the sequence information ( compared to SSL )
-
Ti-MAE
-
input TS : assumed to follow an integrate distribution
-
randomly masks out TS & reconstruct them
-
adopts ”mask modeling” (rather than contrastive learning) as the auxiliary task
-
bridges the connection between existing representation learning & generative Transformer-based methods
\(\rightarrow\) reducing the difference between upstream and downstream forecasting tasks
-
1. Introduction
generative Transformer-based models
= a special kind of denoising autoencoders
( where we only mask the future values and reconstruct them )
2 problems in continuous masking strategy
-
(1) captures only the information of the visible sequence and some mapping relationship between the historical and the future segments
-
(2) induce severe distribution shift problems
-
especially when the prediction horizon is longer than input sequence
( In reality : most are non-stationary )
-
Disentangled TS
- \(\boldsymbol{y}(t)=\operatorname{Trend}(t)+\operatorname{Seasonality}(t)+\text { Noises}\).
- Trend : \(\sum_n t^n\)
- while the moments of the trend part change continuously over time
- Seasonality : \(\sum_n \cos ^n t\)
- stationary, when we set a proper observation horizon
- Trend : \(\sum_n t^n\)
Size of sliding window to obtain trend is vital !
Problem in decomposition
-
Natural time series data generally have more complex periodic patterns
\(\rightarrow\) have to employ longer sliding windows or other hierarchical disposals
-
Ends of a sequence need to be padded for alignment
\(\rightarrow\) causes inevitable data distortion at the head and tail
Ti-MAE
proposes a novel Transformer-based framework : Ti-MAE
- randomly masks out parts of embedded TS
- learns AE to reconstruct them ( at the point-level )
Random masking ( vs. Fixed continuous masking )
-
(1) takes the overall distribution of inputs
\(\rightarrow\) alleviate the distribution shift problem
-
(2) Encoder-Decoder structure : provides a universal scheme for both forecasting and classification
Contributions
-
Novel perspective to bridge the connection between existing (1) contrastive learning and (2) generative Transformer-based models on TS
& point out the inconsistency and deficiencies of them on downstream tasks
-
Propose Ti-MAE
-
a masked time series autoencoders
-
learn strong representations with less inductive bias or hierarchical trick
-
pros 1) adequately leverages the input TS & successfully alleviates the distribution shift problem
-
pros 2) due to the flexible setting of masking ratio….
\(\rightarrow\) can adapt to complex scenarios which require the trained model to make forecasting simultaneously for multiple time windows with various sizes without re-training
-
-
-
achieved excellent performance for both (1) forecasting and (2) classification tasks
2. Related Works
(1) Transformer-based TS model
Transformer : can capture long-range dependencies
ex 1) Song et al. (2018); Ma et al. (2019); LI et al. (2019)
-
directly apply vanilla Transformer to TS
-
failed in long sequence TS forecasting tasks
\(\because\) self-attention operation scales quadratically with the input TS length
ex 2) Child et al. (2019); Zhou et al. (2021); Liu et al. (2022)
- noticed the long tail distribution in self-attention feature map
- utilized sparse attention mechanism to reduce time complexity and memory usage
- but applying too long input TS in training stage will degrade the forecasting accuracy of the model
ex 3) latest works : ETSformer (Woo et al., 2022b) & FEDformer (Zhou et al., 2022)
- rely heavily on disentanglement and extra introduced domain knowledge
(2) TS Representation Learning
SSL : good performance in TS domain ( especially contrastive learning )
ex 1) Lei et al. (2019); Franceschi et al. (2019)
- used loss function of metric learning to preserve pairwise similarities in the time domain
ex 2) CPC (van den Oord et al., 2018)
- first proposed contrastive predictive coding and InfoNCE
- which treats the …
- data from the same sequence : POS pairs
- different noise data from the mini-batch : NEG pairs
ex 3) DA on TS data
- capture transformation-invariant features at semantic level (Eldele et al., 2021; Yue et al., 2022)
ex 4) **CoST (Woo et al., 2022a) **
- introduced extra inductive biases in frequency domain through DFT
- separately processed disentangled “trend and seasonality” parts of the original TS
- to encourage discriminative seasonal and trend representations
\(\rightarrow\) Almost all of these methods rely on heavily data augmentation or other domain knowledge
3. Methodology
(1) Problem Definition
MTS : \(\mathcal{X}=\left(\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_T\right) \in \mathbb{R}^{T \times m}\)
Forecasing Task :
- input : \(\mathcal{X}_h \in \mathbb{R}^{h \times m}\) with length of \(h\)
- target : \(\mathcal{X}_f \in \mathbb{R}^{k \times n}\) : next \(k\) steps values, where \(n \leq m\)
Classification Task : (pass)
(2) Model Architecture
Encoder : maps \(\mathcal{X} \in \mathbb{R}^{T \times m}\) to \(\mathcal{H} \in \mathbb{R}^{T \times n}\)
Decpder : reconstructs the original sequence from the embedding
+ adopt an asymmetric design
-
[Encoder] only operates visible tokens after applying masking on input embedding
-
[Decoder]
- processes encoded tokens padded with masked tokens
- reconstructs the original TS at the point-level
a) Input Embedding
Model : 1d-conv
- ( do not adopted any multi-scale or complex convolution scheme ( ex. dilated convolution ) )
- extract local temporal features on timestamp across channels
Positional Embeddings;
- fixed sinusoidal positional embeddings
+ do not add any handcrafting task-specific / date-specific embeddings
( so as to introduce as little inductive bias as possible )
b) Masking
After tokenizing ….
(1) randomly sample a subset of tokens
- without replacement + from uniform distribution
(2) mask the remaining parts
Masking Ratio
(He et al., 2021; Feichtenhofer et al., 2022)
-
related to the information density and redundancy of the data
-
immense impact on the performance of the AE
- applications
- natural language : higher information density ( due to its highly discrete word distn )
- BERT : 15%
- images : heavy spatial redundancy ( single pixel in one image has lower semantic information )
- MAE for image : 75%
- MAE for videos : 90%
- natural language : higher information density ( due to its highly discrete word distn )
- data with lower information density : should be applied a higher masking ratio
- TS data : ( similar as images ) also have local continuity
- determine a high masking ratio ( = 75% )
c) Ti-MAE Encoder
vanilla Transformer blocks
- utilizes pre-norm ( instead of post-norm )
- applied only on visible tokens after embedding and random masking
- significantly reduces time complexity and memory usage ( compared to full encoding )
d) Ti-MAE Decoder
vanilla Transformer blocks
- applied on the union of the (1) encoded visible tokens & (2) learnable randomly initialized mask tokens
- smaller than the encoder
- add positional embeddings to all tokens after padding
- last layer : linear projection
- reconstructs the input by predicting all the values at the point-level
Loss function : MSE ( on masking regions )
encoder and decoder of Ti-MAE :
- both agnostic to the sequential data with as less domain knowledge as possible
- no date-specific embedding, hierarchy or disentanglement
- point-level modeling rather than patch embedding
- for the consistency between (1) masked modeling & (2) downstream forecasting tasks
4. Experiments
(1) Experimental Setup
a) Dataset & Tasks
Task :
- (1) TS forecasting
- (2) TS classification
Dataset :
- (1) ETT (Electricity Transformer Temperature) (Zhou et al., 2021)
- data collected from electricity transformers
- recording six power load features and oil temperature
- (2) Weather1
- contains 21 meteorological indicators like humidity, pressure in 2020 year from nearly 1600 locations in the U.S
- (3) Exchange (Lai et al., 2018)
- collection of exchange rates among eight different countries from 1990 to 2016
- (4) ILI2
- records the weekly influenza-like illness (ILI) patients dat
- (5) The UCR archive (Dau et al., 2019)
- 128 different datasets covering multiple domains
Data Split
- (ETT) 6 : 2 : 2
- (others) 7 : 1 : 2
Classification ( UCR archive ) : has been already divided into training and test set
- where the size of test » train
b) Baselines
2 types of baselines
- (1) Transformer-based end-to-end
- (2) Representation learning methods ( which have public official codes )
Time series forecasting
- 4 SOTA Representation learning models :
- CoST (Woo et al., 2022a)
- TS2Vec (Yue et al., 2022)
- TNC (Tonekaboni et al., 2021)
- MoCo (Chen et al., 2021)
- 4 SOTA Transformer-based end-to-end models :
- FEDformer (Zhou et al., 2022)
- ETSformer (Woo et al., 2022b)
- Autoformer (Wu et al., 2021)
- Informer (Zhou et al., 2021)
Time series classification
- include more competitive unsupervised representation learning methods
- TS2Vec (Franceschi et al., 2019)
- T-Loss (Franceschi et al., 2019)
- TS-TCC (Eldele et al., 2021)
- TST (Zerveas et al., 2021)
- TNC (Tonekaboni et al., 2021)
- DTW (Chen et al., 2013)
c) Implementation Details
Encoder and Decoder :
- 2 layers of vanilla Transformer blocks
- with 4 heads self-attention
- hidden dimension is = 64
Others
- Optimizer : Adam ( lr = 1e-3 )
- Batch Size : 64
- Samplimg Time : 30 in each iteration
Evaluation metric
- forecasting : MSE & MAE
- classificaiton : average ACC ( + critical difference (CD) )
(2) TS Forecasting
under different future horizons ( for both short & long term )
a) Table 1 ( vs. representation learning models )
-
does not require any extra regressor after pre-trained
( \(\because\) decoder can directly generate future TS )
b) Table 2 ( vs. Transformer-based models )
-
pre-trained only one Ti-MAE model
( \(\leftrightarrow\) end-to-end supervised models : should be trained separately for different settings )
-
Ti-MAE († : fine-tuned version) : just utilize its (frozen) encoder with an additional linear projection layer
c) Ablation Study
(3) TS Classification
Instance-level representation on classification tasks.
Dataset : 128 UCR archive
a) Accuracy
b) Critical Difference diagram (Demsar, 2006)
classifiers connected by a bold line do not have a significant difference.
5. Conclusion
novel self-supervised framework : Ti-MAE
-
randomly masks out tokenized TS
-
learns an AE to reconstruct them at the point-level
-
bridges the connection between
- (1) contrastive representation learning
- (2) generative Transformer-based method
-
improves the performance on forecasting tasks …
- due to reducing the inconsistency of upstream and downstream tasks
-
Random masking strategy :
- leverages all the input sequence
- alleviates the distribution shift problem
\(\rightarrow\) makes Ti-MAE more adaptive to various prediction scenarios with different time steps