Multivariate Time Series Forecasting with Transfer Entropy Graph (2020, 5)

Abstract
Introduction
Preliminaries
1. Neural Granger Causality
2. Graph Neural Network
Methodology
1. Problem Formulation
2. Causality Graph Structure with Transfer Entropy
3. Feature Extraction of Multiple Receptive Fields
4. Node Embedding based on Causality Matrix

Abstract

most methods :

assume that predicted value of single variable is affected by ALL other variables

\(\rightarrow\) ignore the causal relationship among variables

propose CauGNN (graph neural network with Neural Granger Causality)

node = each variable
edge = causal relationship between variables

CNN filters with different perception scales are used ( for feature extraction )

1. Introduction

LSTNet

encodes short-term local information into low-dim vector, using 1d-CNN
decode the vectors through RNN

\(\rightarrow\) but, existing models cannot model the pairwise causal dependencies among MTS explicitly

G-causality ( Granger Causality Analysis )

quantitative characterization of TS causality
but, cannot well handle non-linear relationships

Transfer Entropy (TE)

causal analysis, which can deal with non-linear cases

CauGNN

proposed method
1) after pairwise TE is calculated, TF matrix can be obtained

\(\rightarrow\) used as “adjacency matrix”
2) CNN filters with different perception scale

\(\rightarrow\) for TS feature extraction
3) GNN for embedding & forecasting

[ Contributions ]

consider MTS as a graph structure with causality
use transfer entropy to extract causality

& construct TE graph

2. Preliminaries

(1) Neural Granger Causality

Granger Causality

if addition of historical info of variable \(i\) improves the prediction of variable \(j\),

\(\rightarrow\) variable \(i\) is the cause of variable \(j\)

Granger Causality vs Neural Granger Causality

Granger Causality
- traditional linear model
- use the AR model for prediction
Neural Granger Causality
- use DL & regularization for non-linearity

Neural Granger Network

consists of 2 parts

1) variable selection module
2) prediction module

a) variable selection module

FC layer ( input = historical TS )
select key variables by group Lasso regularization
- if specific group is not significant \(\rightarrow\) will be zero (0)

b) prediction module

MLP or LSTM
general prediction method

Objective function

\(\min _{W} \sum_{t=K}^{T}\left(x_{i t}-g_{i}\left(\boldsymbol{x}_{(t-1):(t-K)}\right)\right)^{2}+\lambda \sum_{j=1}^{p} \mid \mid \left(\boldsymbol{W}_{: j}^{\mathbf{1 1}}, \ldots, \boldsymbol{W}_{: j}^{\mathbf{1} K}\right) \mid \mid _{F}\).

(2) Graph Neural Network

most of GNN approaches fit within the framework of “neural message passing”

GNN can be viewed as… message passing algorithm,

where node representations are iteratively computed from the features of neighbor nodes

( using differentiable aggregation function )

GCN

can be regarded as…
- 1) approximation of spectral-domain convolution of graph signals
- 2) sampling & aggregating of neighborhood information

GraphSAGE

enabling training in batches

GIN (Graph Isomorphism Network)

enable more complex forms of aggregation

GAT

attending over the neighbors via self-attention

3. Methodology

(1) Problem Formulation

observed TS : \(\boldsymbol{X}_{n}=\left[\boldsymbol{x}_{1}, \boldsymbol{x}_{\boldsymbol{2}}, \ldots, \boldsymbol{x}_{t}\right]\)
- \(\boldsymbol{x}_{\boldsymbol{i}} \in \mathbb{R}^{n}(i=\) \(1, \ldots, n)\)
- \(n\) : number of variables
goal : predict \(\boldsymbol{x}_{t+h}\)

(2) Causality Graph Structure with Transfer Entropy

(Information) Entropy : \(H(\boldsymbol{X})=-\sum p(x) \log _{2} p(x)\).

larger entropy \(\rightarrow\) more information

Conditional Entropy : \(H(\boldsymbol{X} \mid \boldsymbol{Y})=-\sum \sum p(x, y) \log _{2} p(x \mid y)\)

information about \(\mathbf{X}\), given \(\mathbf{Y}\) is known

TE of variables \(\mathbf{Y}\) to \(\mathbf{X}\) :

\(\begin{aligned} T_{\boldsymbol{Y} \rightarrow \boldsymbol{X}}=& \sum p\left(x_{t+1}, \boldsymbol{x}_{t}^{(k)}, \boldsymbol{y}_{t}^{(l)}\right) \log _{2} p\left(x_{t+1} \mid \boldsymbol{x}_{t}^{(k)}, \boldsymbol{y}_{t}^{(l)}\right) \\ &-\sum p\left(x_{t+1}, \boldsymbol{x}_{t}^{(k)}\right) \log _{2} p\left(x_{t+1} \mid \boldsymbol{x}_{t}^{(k)}\right) \\ =& \sum p\left(x_{t+1}, \boldsymbol{x}_{t}^{(k)}, \boldsymbol{y}_{t}^{(l)}\right) \log _{2} \frac{p\left(x_{t+1} \mid \boldsymbol{x}_{t}^{(k)}, \boldsymbol{y}_{t}^{(l)}\right)}{p\left(x_{t+1} \mid \boldsymbol{x}_{t}^{(k)}\right)} \\ =& H\left(\boldsymbol{X}_{t+1} \mid \boldsymbol{X}_{t}\right)-H\left(\boldsymbol{X}_{t+1} \mid \boldsymbol{X}_{t}, \boldsymbol{Y}_{t}\right) \end{aligned}\).

“increase in information of \(X\)“, when \(Y\) changes from “unknown to known”
assymetric

Causal relationship between \(X\) & \(Y\)

\(T_{\boldsymbol{X}, \boldsymbol{Y}}=T_{\boldsymbol{X} \rightarrow \boldsymbol{Y}}-T_{\boldsymbol{Y} \rightarrow \boldsymbol{X}}\).
- \(T_{\boldsymbol{X}, \boldsymbol{Y}} > 0\) : \(\boldsymbol{X}\) is the cause of \(\boldsymbol{Y}\)
- \(T_{\boldsymbol{X}, \boldsymbol{Y}} < 0\) : \(\boldsymbol{Y}\) is the cause of \(\boldsymbol{X}\)

use neural granger to capture causality

causality matrix \(\mathbf{T}\) of mts \(X_n\) can be formulated with…

\(t_{i j}=\left\{\begin{array}{lc} T_{\boldsymbol{x}_{i}, \boldsymbol{x}_{j}}, & T_{\boldsymbol{x}_{i}, \boldsymbol{x}_{j}}>c \\ 0, & \text { otherwise } \end{array}\right.\).

\(\rightarrow\) use as “adjacency matrix of MTS graph”

(3) Feature Extraction of Multiple Receptive Fields

necessary to consider trend & seasonality

\(\rightarrow\) thus, need to extract the features of TS in units of multiple certain periods

\(\rightarrow\) use multiple CNN filters

\(\boldsymbol{h}_{\boldsymbol{i}}=\operatorname{ReLU}\left(\boldsymbol{W}_{\boldsymbol{i}} * \boldsymbol{x}+\boldsymbol{b}_{\boldsymbol{i}}\right), \boldsymbol{h}=\left[\boldsymbol{h}_{\mathbf{1}} \oplus \boldsymbol{h}_{\mathbf{2}} \oplus \ldots \oplus \boldsymbol{h}_{\boldsymbol{p}}\right] .\).

(4) Node Embedding based on Causality Matrix

after feature extraction,

input MTS is converted into feature matrix \(\boldsymbol{H} \in \mathbb{R}^{n \times d}\)

feature matrix with \(n\) nodes

adjacency of nodes is determined by causality matrix \(T\)

inspired by k-GNNs

only perform information fusion between “neighbors”

( ignore information of non-neighbors )

Forward-pass update :

\(\boldsymbol{h}_{i}^{(l+1)}=\sigma\left(\boldsymbol{h}_{i}^{(l)} \boldsymbol{W}_{1}^{(l)}+\sum_{j \in N(i)} \boldsymbol{h}_{j}^{(l)} \boldsymbol{W}_{2}^{(l)}\right)\).

Twitter Facebook LinkedIn

(paper) Multivariate Time Series Forecasting with Transfer Entropy Graph

Seunghan Lee