Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows (2021)

Contents

  1. Abstract
  2. Introduction
  3. Background
    1. Density Estimation via NF
    2. Self-Attention
  4. Temporal Conditioned NF
    1. RNN Conditioned Real NVP
    2. Transformer Conditioned Real NVP
  5. Training
    1. Training
    2. Covariates


0. Abstract

여러 time series 존재 시, 쉽게 푸는 법 : Independence 가정?

\(\rightarrow\) ignore “interaction effect”


DL 방법론들을 통해 위 interaction 포착 O…

BUT multivariate models often assume a simple parametric distn & do not scale to high dim


[ Proposal ]

model the multivariate temporal dynamics of t.s via Autoregressive DL model,

where data distn is represented by conditioned NF(Normalizing Flow)


1. Introduction

(1) Classical t.s

  • univariate forecast
  • require hand-tuned features


(2) DL t.s

  • RNN ( LSTM, GRU )
  • popular due to ..
    • 1) ‘end-to-end’ training
    • 2) incorporating exogenous covariates
    • 3) automatic feature xtraction


Output can be

  • (a) deterministic ( = point estimation )
  • (b) probabilistic

\(\rightarrow\) w.o probabilistic modeling, the importance of the forecast in regions of “low noise” vs “high noise” cannot be distinguished


Proposal

end-to-end trainable AUTOREGRESSIVE DL architecture for PROBABILISTIC FORECASTING

that explicitly models MTS and their temporal dynamics by employing NORMALIZING FLOWS ( ex. MAF, Real NVP )


2. Background

(1) Density Estimation via NF

NF (Normalizing Flow)

  • mapping \(\mathbb{R}^{D}\) \(\rightarrow\) \(\mathbb{R}^{D}\)

  • \(f: \mathcal{X} \mapsto \mathcal{Z}\) : composed of a sequence of bijections or invertible functions

  • change of variables formula :

    \(p_{\mathcal{X}}(\mathbf{x})=p_{\mathcal{Z}}(\mathbf{z}) \mid \operatorname{det}\left(\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}\right) \mid\).

  • inverse (should be) easy to evaluate ( \(\mathbf{x}=f^{-1}(\mathbf{z})\) )

  • computing Jacobian determinant takes \(O(D)\) time


ex) Real NVP (2017)

  • use coupling layer

  • \(\left\{\begin{array}{l} \mathbf{y}^{1: d}=\mathbf{x}^{1: d} \\ \mathbf{y}^{d+1: D}=\mathbf{x}^{d+1: D} \odot \exp \left(s\left(\mathbf{x}^{1: d}\right)\right)+t\left(\mathbf{x}^{1: d}\right) \end{array}\right.\).

  • change of variables formula :

    \(\log p_{\mathcal{X}}(\mathbf{x})=\log p_{\mathcal{Z}}(\mathbf{z})+\log \mid \operatorname{det}(\partial \mathbf{z} / \partial \mathbf{x}) \mid =\log p_{\mathcal{Z}}(\mathbf{z})+\sum_{i=1}^{K} \log \mid \operatorname{det}\left(\partial \mathbf{y}_{i} / \partial \mathbf{y}_{i-1}\right) \mid .\).

  • Jacobian = “block-triangular matrix”

    \(\therefore\) \(\log \mid \operatorname{det}\left(\partial \mathbf{y}_{i} / \partial \mathbf{y}_{i-1}\right) \mid =\log \mid \exp \left(\operatorname{sum}\left(s_{i}\left(\mathbf{y}_{i-1}^{1: d}\right)\right) \mid\right.\)

  • maximize average log likelihood, \(\mathcal{L}=\frac{1}{ \mid \mathcal{D} \mid } \sum_{\mathbf{x} \in \mathcal{D}} \log p_{\mathcal{X}}(\mathbf{x} ; \theta)\).


ex) MAF (Masked Autoregressive Flows) (2017)

  • generalization of Real NVP

  • transformation layer = “AUTOREGRESSIVE” NN

    \(\rightarrow\) makes the Jacobian Triangular \(\rightarrow\) reduce computational cost ( tractable Jacobian! )


(2) Self-Attention

Transformer’s self-attention

\(\rightarrow\) enables to capture both long & short-term dependencies

\(\mathbf{O}_{h}=\operatorname{Attention}\left(\mathbf{Q}_{h}, \mathbf{K}_{h}, \mathbf{V}_{h}\right)=\operatorname{softmax}\left(\frac{\mathbf{Q}_{h} \mathbf{K}_{h}^{\top}}{\sqrt{d_{K}}} \cdot \mathbf{M}\right) \mathbf{V}_{h}\).


3. Temporal Conditioned NF

Notation

  • MTS : \(x_{t}^{i} \in \mathbb{R}\) for \(i \in\{1, \ldots, D\}\) ( \(t\) = time index )
    • \(\mathbf{x}_{t} \in \mathbb{R}^{D}\).
    • \(i \in\{1, \ldots, D\}\).
    • \(t\) = time index
  • (for training) split this time seires by..
    • 1) context window \(\left[1, t_{0}\right)\)
    • 2) prediction window \(\left[t_{0}, T\right]\)


Simple model for MTS data : using “factorizing distn” in the emission model

  • shared param : learn patterns across individual time series

  • to capture dependencies…. “full joint distn”

    \(\rightarrow\) BUT, full covariance matrix :

    • number of param in NN : \(O\left(D^{2}\right)\)
    • computing loss is TOO expensive


Wish to have a “SCALABLE” model & flexible distn model on the emission

\(\rightarrow\) model the conditional joint distn at time \(t\) of all time series \(p_{\mathcal{X}}\left(\mathbf{x}_{t} \mid \mathbf{h}_{t} ; \theta\right)\) with NF, conditioned on either the hidden state of RNN at time \(t\), or an embedding of t.s up to \(t-1\) from attention module


(1) RNN Conditioned Real NVP

Ex) autoregressive RNN (LSTM,GRU)..

  • \(\mathbf{h}_{t}=\mathrm{RNN}\left(\operatorname{concat}\left(\mathbf{x}_{t-1}, \mathbf{c}_{t-1}\right), \mathbf{h}_{t-1}\right)\).


For powerful emission distn model… “stack \(K\) layers” of conditional flow ( Real NVP, MAF … )

  • \(p_{\mathcal{X}}\left(\mathbf{x}_{t_{0}: T} \mid \mathbf{x}_{1: t_{0}-1}, \mathbf{c}_{1: T} ; \theta\right)=\Pi_{t=t_{0}}^{T} p_{\mathcal{X}}\left(\mathbf{x}_{t} \mid \mathbf{h}_{t} ; \theta\right)\).

    ( \(\theta\) : set of all param of both flow & RNN )


figure2


(2) Transformer Conditioned Real NVP

using Transformer

\(\rightarrow\) allows the model to “access any part of historic time series”, regardless of temporal distnace

( better than RNN! )


figure2


4. Training

(1) Training

maximize log-likelihood ( given in [2. Background] )

  • \(\mathcal{L}=\frac{1}{ \mid \mathcal{D} \mid T} \sum_{\mathbf{x}_{1: T} \in \mathcal{D}} \sum_{t=1}^{T} \log p_{\mathcal{X}}\left(\mathbf{x}_{t} \mid \mathbf{h}_{t} ; \theta\right)\).


  • time series \(\mathbf{x}_{1:T}\) in batch \(D\) are selected from a random time window of size \(T\)

    ( + this increases the size of training data)

  • absolute time is only available to RNN/Transformer via the “covariates”

    ( not the relative position of \(x_t\) in training data )


Computational complexity

  • Transformer : \(O\left(T^{2} D\right)\)
  • RNN : \(O\left(T D^2 \right)\)

\(\rightarrow\) for \(D>T\) ( large multivariate time series ) … Transformer has much smaller complexity!


(2) Covariates

  • use embeddings from “categorical features”
  • covariates \(\mathbf{c}_t\)
    • 1) time-dependent ( ex. 요일, 시 등 )
    • 2) time-independent
  • all covariates “are KNOWN” for time periods we want to forecast

Tags:

Categories:

Updated: