SimTS: Rethinking Contrastive Representation Learning for Time Series ForecastingPermalink


ContentsPermalink

  1. Abstract
  2. Introduction
    1. Problems of CL in TS
    2. SimTS
    3. Contribution
  3. Methods
    1. Notation
    2. Four parts of SimTS
    3. Process
    4. Multi-scale Encoder
    5. Stop-gradient
    6. Final Loss


0. AbstractPermalink

Contrastive learning in TS

Problem (1)

  • GOOD for TS classification
  • BAD for TS forecasting … Reason?
    • Optimization of instance discrimination is not directly applicable to predicting the future state from the history context.


Problem (2)

  • Construction of positive and negative pairs strongly relies on specific time series characteristics

restricting their generalization across diverse types of time series data


Proposal : SimTS ( = simple representation learning approach for improving time series forecasting )

  • by learning to predict the future from the past in the latent space
  • does not rely on negative pairs or specific assumptions about the characteristics of TS


1. IntroductionPermalink

(1) Problems of CL in TSPermalink

Problem 1) Bad for TSF

Mostly rely on instance discrimination

  • can discriminate well between different instances of TS ( good for TSC )
  • but features learned by instance discrimination may not be sufficient for TSF


Problem 2) Defining POS & NEG

identifying positive and negative pairs for time series forecasting is challenging

  • previous works : several assumptions

    • (1) the similarity between segments of the same time series decreases as the time lag increases
    • (2) segments of distinctive time series are dissimilar

    However, particular time series do not adhere to these assumptions

    • ex) TS with seasonality??


figure2


(2) SimTSPermalink

aims to answer the following key question:

Q1) “What is important for TSF with CL, and how can we adapt contrastive ideas more effectively to TSF tasks?”

  • Beyond CL, propose Simple Representation Learning Framework for Time Series Forecasting (SimTS)

    • inspired by predictive coding
    • learn a representation such that the latent representation of the future time windows can be predicted from the latent representation of the history time windows
    • build upon a siamese network structure
  • Details : propose key refinements

    • (1) divide a given TS into history and future segments

    • (2) ENCODER : map to latent space

    • (3) PREDICTIVE layer : predict the latent representation of the future segment from the history segment.

      • ( predicted representation & encoded representation ) = positive pairs

        representations learned in this way encode features that are useful for forecasting tasks.


Q2) Questions existing assumptions and techniques used for constructing POS & NEG pairs.

  • detailed discussion and several experiments showing their shortcomings

    • ex) question the idea of augmentation
  • SimTS does not use negative pairs to avoid false repulsion

  • hypothesize that the most important mechanism behind representation learning for TSF

    = maximizing the shared information between representations of history and future time windows.


(3) ContributionPermalink

(1) propose a novel method (SimTS) for TSF

  • employs a siamese structure and a simple convolutional encoder
  • learn representations in latent space without requiring negative pairs


(2) Experiments on multiple types of benchmark datasets.

  • SOTA Our method outperforms state-of-the-art methods for multivariate time series forecasting

    ( BUT still worse than Supervised TSF models & MTM )


(3) extensive ablation experiments


2. MethodsPermalink

(1) NotationPermalink

  • input TS : X=[x1,x2,,xT]RC×T,
    • C : the number of features (i.e., variables)
    • T : the sequence length
  • Segmented subTS
    • history segment : Xh=[x1,x2,,xK], where 0<K<T,
    • future segment : Xf=[xK+1,xK+2,,xT]
  • Encoder : Fθ
    • maps historical and future segments to their corresponding latent representations
    • learn an informative latent representation Zh=Fθ(Xh)=[zh1,zh2,,zhK]RC×K
      • will be used to predict the latent representation of the future through a prediction network


(2) Four parts of SimTSPermalink

Objective : learns time series representations by maximizing the similarity between ..

  • (1) predicted latent features
  • (2) encoded latent features

for each timestamp.


( Consists of FOUR main parts )

(1) Siamese network

  • consists of two identical networks that share parameters.

  • TS is divided into the (a) history segment Xh & (b) future segment Xf

    given as inputs to the siamese network.

  • learns to map them to their latent representations Zh,Zf.

(2) Multi-scale encoder

  • consist of a projection layer
  • projects raw features into a high dimensional space and multiple CNN blocks with different kernel sizes.

(3) Predictor network Gϕ

  • takes the last column of the encoded history view as input and predicts the future in latent space.

(4) Cosine similarity loss

  • considers only positive samples


figure2


(3) ProcessPermalink

Encoding

  • History encoding : Zh=Fθ(Xh)

  • Future encoding : Zf=Fθ(Xf)=[zfK+1,zfK+2,,zfT]RC×(TK)


Prediction

  • use predictior network Gϕ ( = MLP ) on the last column of Zh ( = zhK )
  • to predict the future latent representations: ˆZf=Gϕ(zhK)=[ˆzfK+1,ˆzfK+2,,ˆzfT]RC×(TK).
    • last column : allows the encoder to condense the history information into a summary by properly choosing the kernel size.
  • positive pair = ( Zf, ˆZf )
  • calculate the negative cosine similarity between them
    • Sim(ˆZf,Zf)=1TKTi=K+1ˆzfiˆzfi2zfizfi2.


(4) Multi-scale EncoderPermalink

figure2

Structure of Fθ plays a vital role!

  • Should extract temporal dependency from local ~ global patterns
    • for SHORT-term forecasting : local patterns
    • for LONG-term forecasting : global patterns
  • thus, propose to use CNN with multiple kernel sizes ( of total m )


Details of Fθ :

  • Step 1) each TS is passed through CNN projection layer

  • Step 2) for a time series X with length K, we have m=[log2K]+1 parallel CNN layers on the top of the projection layer
    • i th convolution has kernel size 2i, where i{0,1,,m}.
    • each convolution i takes the latent features from the projection layer and generates a representation ˆZ(i).
  • Step 3) Averaging
    • the final multi-scale representation Z are obtained by averaging across ˆZ(0),ˆZ(1),,ˆZ(m).


(5) Stop-gradientPermalink

  • stop-gradient operation to the future encoding path

  • As the encoder should constrain the latent of the past to be predictive of the latent of the future ..

    only ˆZf can only move towards Zf in the latent space,


(6) Final LossPermalink

( for one sample X=[Xh,Xf] )

Lθ,ϕ(Xh,Xf)=Sim(Gϕ(Fθ(Xh)),Fsg(θ)(Xf))=Sim(ˆZf,sg(Zf)).


( Loss for a mini-batch D={Xhi,Xfi}i[1:N] )

Lθ,ϕ(D)=1NNi=1Lθ,ϕ(Xhi,Xfi).

Categories: ,

Updated: