Ti-MAE : Self-Supervised Masked Time Series Auto Encoders

( https://openreview.net/pdf?id=9AuIMiZhkL2 )

Abstract
Introduction
Related Works
Methodology
1. Problem Definition
2. Model Architecture
Experiments
1. Experimental Setup
2. TS Forecasting
3. TS Classification

1. Abstract

Contrastive learning & Transformer-based models : good performance on long-term TS forecasting

Problems :

(1) Contrastive learning :
- training paradigm of contrastive learning and downstream prediction tasks are inconsistent
(2) Transformer-based models :
- resort to similar patterns in historical time series data for predicting future values
  
  \(\rightarrow\) induce severe distribution shift problems
- do not fully leverage the sequence information ( compared to SSL )

Ti-MAE

input TS : assumed to follow an integrate distribution
randomly masks out TS & reconstruct them
adopts ”mask modeling” (rather than contrastive learning) as the auxiliary task
- bridges the connection between existing representation learning & generative Transformer-based methods
  
  \(\rightarrow\) reducing the difference between upstream and downstream forecasting tasks

1. Introduction

generative Transformer-based models

= a special kind of denoising autoencoders

( where we only mask the future values and reconstruct them )

2 problems in continuous masking strategy

(1) captures only the information of the visible sequence and some mapping relationship between the historical and the future segments
(2) induce severe distribution shift problems
- especially when the prediction horizon is longer than input sequence
  
  ( In reality : most are non-stationary )

Disentangled TS

\(\boldsymbol{y}(t)=\operatorname{Trend}(t)+\operatorname{Seasonality}(t)+\text { Noises}\).
- Trend : \(\sum_n t^n\)
  - while the moments of the trend part change continuously over time
- Seasonality : \(\sum_n \cos ^n t\)
  - stationary, when we set a proper observation horizon

Size of sliding window to obtain trend is vital !

Problem in decomposition

Natural time series data generally have more complex periodic patterns

\(\rightarrow\) have to employ longer sliding windows or other hierarchical disposals
Ends of a sequence need to be padded for alignment

\(\rightarrow\) causes inevitable data distortion at the head and tail

Ti-MAE

proposes a novel Transformer-based framework : Ti-MAE

randomly masks out parts of embedded TS
learns AE to reconstruct them ( at the point-level )

Random masking ( vs. Fixed continuous masking )

(1) takes the overall distribution of inputs

\(\rightarrow\) alleviate the distribution shift problem
(2) Encoder-Decoder structure : provides a universal scheme for both forecasting and classification

Contributions

Novel perspective to bridge the connection between existing (1) contrastive learning and (2) generative Transformer-based models on TS

& point out the inconsistency and deficiencies of them on downstream tasks
Propose Ti-MAE
- a masked time series autoencoders
- learn strong representations with less inductive bias or hierarchical trick
  - pros 1) adequately leverages the input TS & successfully alleviates the distribution shift problem
  - pros 2) due to the flexible setting of masking ratio….
    
    \(\rightarrow\) can adapt to complex scenarios which require the trained model to make forecasting simultaneously for multiple time windows with various sizes without re-training
achieved excellent performance for both (1) forecasting and (2) classification tasks

(1) Transformer-based TS model

Transformer : can capture long-range dependencies

ex 1) Song et al. (2018); Ma et al. (2019); LI et al. (2019)

directly apply vanilla Transformer to TS
failed in long sequence TS forecasting tasks

\(\because\) self-attention operation scales quadratically with the input TS length

ex 2) Child et al. (2019); Zhou et al. (2021); Liu et al. (2022)

noticed the long tail distribution in self-attention feature map
utilized sparse attention mechanism to reduce time complexity and memory usage
but applying too long input TS in training stage will degrade the forecasting accuracy of the model

ex 3) latest works : ETSformer (Woo et al., 2022b) & FEDformer (Zhou et al., 2022)

rely heavily on disentanglement and extra introduced domain knowledge

(2) TS Representation Learning

SSL : good performance in TS domain ( especially contrastive learning )

ex 1) Lei et al. (2019); Franceschi et al. (2019)

used loss function of metric learning to preserve pairwise similarities in the time domain

ex 2) CPC (van den Oord et al., 2018)

first proposed contrastive predictive coding and InfoNCE
which treats the …
- data from the same sequence : POS pairs
- different noise data from the mini-batch : NEG pairs

ex 3) DA on TS data

capture transformation-invariant features at semantic level (Eldele et al., 2021; Yue et al., 2022)

ex 4) **CoST (Woo et al., 2022a) **

introduced extra inductive biases in frequency domain through DFT
separately processed disentangled “trend and seasonality” parts of the original TS
- to encourage discriminative seasonal and trend representations

\(\rightarrow\) Almost all of these methods rely on heavily data augmentation or other domain knowledge

3. Methodology

(1) Problem Definition

MTS : \(\mathcal{X}=\left(\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_T\right) \in \mathbb{R}^{T \times m}\)

Forecasing Task :

input : \(\mathcal{X}_h \in \mathbb{R}^{h \times m}\) with length of \(h\)
target : \(\mathcal{X}_f \in \mathbb{R}^{k \times n}\) : next \(k\) steps values, where \(n \leq m\)

Classification Task : (pass)

(2) Model Architecture

Encoder : maps \(\mathcal{X} \in \mathbb{R}^{T \times m}\) to \(\mathcal{H} \in \mathbb{R}^{T \times n}\)

Decpder : reconstructs the original sequence from the embedding

+ adopt an asymmetric design

[Encoder] only operates visible tokens after applying masking on input embedding
[Decoder]
- processes encoded tokens padded with masked tokens
- reconstructs the original TS at the point-level

a) Input Embedding

Model : 1d-conv

( do not adopted any multi-scale or complex convolution scheme ( ex. dilated convolution ) )
extract local temporal features on timestamp across channels

Positional Embeddings;

fixed sinusoidal positional embeddings

+ do not add any handcrafting task-specific / date-specific embeddings

( so as to introduce as little inductive bias as possible )

b) Masking

After tokenizing ….

(1) randomly sample a subset of tokens

without replacement + from uniform distribution

(2) mask the remaining parts

Masking Ratio

(He et al., 2021; Feichtenhofer et al., 2022)

related to the information density and redundancy of the data
immense impact on the performance of the AE
applications
- natural language : higher information density ( due to its highly discrete word distn )
  - BERT : 15%
- images : heavy spatial redundancy ( single pixel in one image has lower semantic information )
  - MAE for image : 75%
  - MAE for videos : 90%
data with lower information density : should be applied a higher masking ratio
TS data : ( similar as images ) also have local continuity
- determine a high masking ratio ( = 75% )

c) Ti-MAE Encoder

vanilla Transformer blocks

utilizes pre-norm ( instead of post-norm )
applied only on visible tokens after embedding and random masking
- significantly reduces time complexity and memory usage ( compared to full encoding )

d) Ti-MAE Decoder

vanilla Transformer blocks

applied on the union of the (1) encoded visible tokens & (2) learnable randomly initialized mask tokens
smaller than the encoder
add positional embeddings to all tokens after padding
last layer : linear projection
- reconstructs the input by predicting all the values at the point-level

Loss function : MSE ( on masking regions )

encoder and decoder of Ti-MAE :

both agnostic to the sequential data with as less domain knowledge as possible
no date-specific embedding, hierarchy or disentanglement
point-level modeling rather than patch embedding
- for the consistency between (1) masked modeling & (2) downstream forecasting tasks

4. Experiments

(1) Experimental Setup

a) Dataset & Tasks

Task :

(1) TS forecasting
(2) TS classification

Dataset :

(1) ETT (Electricity Transformer Temperature) (Zhou et al., 2021)
- data collected from electricity transformers
- recording six power load features and oil temperature
(2) Weather1
- contains 21 meteorological indicators like humidity, pressure in 2020 year from nearly 1600 locations in the U.S
(3) Exchange (Lai et al., 2018)
- collection of exchange rates among eight different countries from 1990 to 2016
(4) ILI2
- records the weekly influenza-like illness (ILI) patients dat
(5) The UCR archive (Dau et al., 2019)
- 128 different datasets covering multiple domains

Data Split

(ETT) 6 : 2 : 2
(others) 7 : 1 : 2

Classification ( UCR archive ) : has been already divided into training and test set

where the size of test » train

b) Baselines

2 types of baselines

(1) Transformer-based end-to-end
(2) Representation learning methods ( which have public official codes )

Time series forecasting

4 SOTA Representation learning models :
- CoST (Woo et al., 2022a)
- TS2Vec (Yue et al., 2022)
- TNC (Tonekaboni et al., 2021)
- MoCo (Chen et al., 2021)
4 SOTA Transformer-based end-to-end models :
- FEDformer (Zhou et al., 2022)
- ETSformer (Woo et al., 2022b)
- Autoformer (Wu et al., 2021)
- Informer (Zhou et al., 2021)

Time series classification

include more competitive unsupervised representation learning methods
- TS2Vec (Franceschi et al., 2019)
- T-Loss (Franceschi et al., 2019)
- TS-TCC (Eldele et al., 2021)
- TST (Zerveas et al., 2021)
- TNC (Tonekaboni et al., 2021)
- DTW (Chen et al., 2013)

c) Implementation Details

Encoder and Decoder :

2 layers of vanilla Transformer blocks
- with 4 heads self-attention
hidden dimension is = 64

Others

Optimizer : Adam ( lr = 1e-3 )
Batch Size : 64
Samplimg Time : 30 in each iteration

Evaluation metric

forecasting : MSE & MAE
classificaiton : average ACC ( + critical difference (CD) )

(2) TS Forecasting

under different future horizons ( for both short & long term )

a) Table 1 ( vs. representation learning models )

does not require any extra regressor after pre-trained

( \(\because\) decoder can directly generate future TS )

b) Table 2 ( vs. Transformer-based models )

pre-trained only one Ti-MAE model

( \(\leftrightarrow\) end-to-end supervised models : should be trained separately for different settings )
Ti-MAE († : fine-tuned version) : just utilize its (frozen) encoder with an additional linear projection layer

c) Ablation Study

(3) TS Classification

Instance-level representation on classification tasks.

Dataset : 128 UCR archive

a) Accuracy

b) Critical Difference diagram (Demsar, 2006)

classifiers connected by a bold line do not have a significant difference.

5. Conclusion

novel self-supervised framework : Ti-MAE

randomly masks out tokenized TS
learns an AE to reconstruct them at the point-level
bridges the connection between
- (1) contrastive representation learning
- (2) generative Transformer-based method
improves the performance on forecasting tasks …
- due to reducing the inconsistency of upstream and downstream tasks
Random masking strategy :
- leverages all the input sequence
- alleviates the distribution shift problem
\(\rightarrow\) makes Ti-MAE more adaptive to various prediction scenarios with different time steps

Twitter Facebook LinkedIn

(paper) Ti-MAE ; Self-Supervised Masked TS Auto Encoders

Seunghan Lee

Ti-MAE : Self-Supervised Masked Time Series Auto Encoders

Contents

1. Abstract

Ti-MAE

1. Introduction

Disentangled TS

Problem in decomposition

Ti-MAE

Contributions

(1) Transformer-based TS model

(2) TS Representation Learning

3. Methodology

(1) Problem Definition

(2) Model Architecture

a) Input Embedding

b) Masking

Masking Ratio

c) Ti-MAE Encoder

d) Ti-MAE Decoder

4. Experiments

(1) Experimental Setup

a) Dataset & Tasks

b) Baselines

c) Implementation Details

(2) TS Forecasting

a) Table 1 ( vs. representation learning models )

b) Table 2 ( vs. Transformer-based models )

c) Ablation Study

(3) TS Classification

a) Accuracy

b) Critical Difference diagram (Demsar, 2006)

5. Conclusion

You May Also Enjoy

(paper) Ti-MAE ; Self-Supervised Masked TS Auto Encoders

Seunghan Lee

Ti-MAE : Self-Supervised Masked Time Series Auto Encoders

Contents

1. Abstract

Ti-MAE

1. Introduction

Disentangled TS

Problem in decomposition

Ti-MAE

Contributions

2. Related Works

(1) Transformer-based TS model

(2) TS Representation Learning

3. Methodology

(1) Problem Definition

(2) Model Architecture

a) Input Embedding

b) Masking

Masking Ratio

c) Ti-MAE Encoder

d) Ti-MAE Decoder

4. Experiments

(1) Experimental Setup

a) Dataset & Tasks

b) Baselines

c) Implementation Details

(2) TS Forecasting

a) Table 1 ( vs. representation learning models )

b) Table 2 ( vs. Transformer-based models )

c) Ablation Study

(3) TS Classification

a) Accuracy

b) Critical Difference diagram (Demsar, 2006)

5. Conclusion

You May Also Enjoy