Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows (2021)

Abstract
Introduction
Background
1. Density Estimation via NF
2. Self-Attention
Temporal Conditioned NF
1. RNN Conditioned Real NVP
2. Transformer Conditioned Real NVP
Training
1. Training
2. Covariates

0. Abstract

여러 time series 존재 시, 쉽게 푸는 법 : Independence 가정?

\(\rightarrow\) ignore “interaction effect”

DL 방법론들을 통해 위 interaction 포착 O…

BUT multivariate models often assume a simple parametric distn & do not scale to high dim

[ Proposal ]

model the multivariate temporal dynamics of t.s via Autoregressive DL model,

where data distn is represented by conditioned NF(Normalizing Flow)

1. Introduction

(1) Classical t.s

univariate forecast
require hand-tuned features

(2) DL t.s

RNN ( LSTM, GRU )
popular due to ..
- 1) ‘end-to-end’ training
- 2) incorporating exogenous covariates
- 3) automatic feature xtraction

Output can be

(a) deterministic ( = point estimation )
(b) probabilistic

\(\rightarrow\) w.o probabilistic modeling, the importance of the forecast in regions of “low noise” vs “high noise” cannot be distinguished

Proposal

end-to-end trainable AUTOREGRESSIVE DL architecture for PROBABILISTIC FORECASTING

that explicitly models MTS and their temporal dynamics by employing NORMALIZING FLOWS ( ex. MAF, Real NVP )

2. Background

(1) Density Estimation via NF

NF (Normalizing Flow)

mapping \(\mathbb{R}^{D}\) \(\rightarrow\) \(\mathbb{R}^{D}\)
\(f: \mathcal{X} \mapsto \mathcal{Z}\) : composed of a sequence of bijections or invertible functions
change of variables formula :

\(p_{\mathcal{X}}(\mathbf{x})=p_{\mathcal{Z}}(\mathbf{z}) \mid \operatorname{det}\left(\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}\right) \mid\).
inverse (should be) easy to evaluate ( \(\mathbf{x}=f^{-1}(\mathbf{z})\) )
computing Jacobian determinant takes \(O(D)\) time

ex) Real NVP (2017)

use coupling layer
\(\left\{\begin{array}{l} \mathbf{y}^{1: d}=\mathbf{x}^{1: d} \\ \mathbf{y}^{d+1: D}=\mathbf{x}^{d+1: D} \odot \exp \left(s\left(\mathbf{x}^{1: d}\right)\right)+t\left(\mathbf{x}^{1: d}\right) \end{array}\right.\).
change of variables formula :

\(\log p_{\mathcal{X}}(\mathbf{x})=\log p_{\mathcal{Z}}(\mathbf{z})+\log \mid \operatorname{det}(\partial \mathbf{z} / \partial \mathbf{x}) \mid =\log p_{\mathcal{Z}}(\mathbf{z})+\sum_{i=1}^{K} \log \mid \operatorname{det}\left(\partial \mathbf{y}_{i} / \partial \mathbf{y}_{i-1}\right) \mid .\).
Jacobian = “block-triangular matrix”

\(\therefore\) \(\log \mid \operatorname{det}\left(\partial \mathbf{y}_{i} / \partial \mathbf{y}_{i-1}\right) \mid =\log \mid \exp \left(\operatorname{sum}\left(s_{i}\left(\mathbf{y}_{i-1}^{1: d}\right)\right) \mid\right.\)
maximize average log likelihood, \(\mathcal{L}=\frac{1}{ \mid \mathcal{D} \mid } \sum_{\mathbf{x} \in \mathcal{D}} \log p_{\mathcal{X}}(\mathbf{x} ; \theta)\).

ex) MAF (Masked Autoregressive Flows) (2017)

generalization of Real NVP
transformation layer = “AUTOREGRESSIVE” NN

\(\rightarrow\) makes the Jacobian Triangular \(\rightarrow\) reduce computational cost ( tractable Jacobian! )

(2) Self-Attention

Transformer’s self-attention

\(\rightarrow\) enables to capture both long & short-term dependencies

\(\mathbf{O}_{h}=\operatorname{Attention}\left(\mathbf{Q}_{h}, \mathbf{K}_{h}, \mathbf{V}_{h}\right)=\operatorname{softmax}\left(\frac{\mathbf{Q}_{h} \mathbf{K}_{h}^{\top}}{\sqrt{d_{K}}} \cdot \mathbf{M}\right) \mathbf{V}_{h}\).

3. Temporal Conditioned NF

Notation

MTS : \(x_{t}^{i} \in \mathbb{R}\) for \(i \in\{1, \ldots, D\}\) ( \(t\) = time index )
- \(\mathbf{x}_{t} \in \mathbb{R}^{D}\).
- \(i \in\{1, \ldots, D\}\).
- \(t\) = time index
(for training) split this time seires by..
- 1) context window \(\left[1, t_{0}\right)\)
- 2) prediction window \(\left[t_{0}, T\right]\)

Simple model for MTS data : using “factorizing distn” in the emission model

shared param : learn patterns across individual time series
to capture dependencies…. “full joint distn”

\(\rightarrow\) BUT, full covariance matrix :
- number of param in NN : \(O\left(D^{2}\right)\)
- computing loss is TOO expensive

Wish to have a “SCALABLE” model & flexible distn model on the emission

\(\rightarrow\) model the conditional joint distn at time \(t\) of all time series \(p_{\mathcal{X}}\left(\mathbf{x}_{t} \mid \mathbf{h}_{t} ; \theta\right)\) with NF, conditioned on either the hidden state of RNN at time \(t\), or an embedding of t.s up to \(t-1\) from attention module

(1) RNN Conditioned Real NVP

Ex) autoregressive RNN (LSTM,GRU)..

\(\mathbf{h}_{t}=\mathrm{RNN}\left(\operatorname{concat}\left(\mathbf{x}_{t-1}, \mathbf{c}_{t-1}\right), \mathbf{h}_{t-1}\right)\).

For powerful emission distn model… “stack \(K\) layers” of conditional flow ( Real NVP, MAF … )

\(p_{\mathcal{X}}\left(\mathbf{x}_{t_{0}: T} \mid \mathbf{x}_{1: t_{0}-1}, \mathbf{c}_{1: T} ; \theta\right)=\Pi_{t=t_{0}}^{T} p_{\mathcal{X}}\left(\mathbf{x}_{t} \mid \mathbf{h}_{t} ; \theta\right)\).

( \(\theta\) : set of all param of both flow & RNN )

(2) Transformer Conditioned Real NVP

using Transformer

\(\rightarrow\) allows the model to “access any part of historic time series”, regardless of temporal distnace

( better than RNN! )

4. Training

(1) Training

maximize log-likelihood ( given in [2. Background] )

\(\mathcal{L}=\frac{1}{ \mid \mathcal{D} \mid T} \sum_{\mathbf{x}_{1: T} \in \mathcal{D}} \sum_{t=1}^{T} \log p_{\mathcal{X}}\left(\mathbf{x}_{t} \mid \mathbf{h}_{t} ; \theta\right)\).

time series \(\mathbf{x}_{1:T}\) in batch \(D\) are selected from a random time window of size \(T\)

( + this increases the size of training data)
absolute time is only available to RNN/Transformer via the “covariates”

( not the relative position of \(x_t\) in training data )

Computational complexity

Transformer : \(O\left(T^{2} D\right)\)
RNN : \(O\left(T D^2 \right)\)

\(\rightarrow\) for \(D>T\) ( large multivariate time series ) … Transformer has much smaller complexity!

(2) Covariates

use embeddings from “categorical features”
covariates \(\mathbf{c}_t\)
- 1) time-dependent ( ex. 요일, 시 등 )
- 2) time-independent
all covariates “are KNOWN” for time periods we want to forecast

Twitter Facebook LinkedIn

(paper) Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows

Seunghan Lee