SOM-VAE : Interpretable Discrete Representation Learning on Time Series (2019)

Contents

  1. Abstract
  2. Introduction
  3. Probabilistic SOM-VAE
    1. Introducing Topological Structure in the latent space
    2. Encouraging Smoothness over Time


0. Abstract

HIGH-DIM time series is common!

  • could benefit from interpretable LOW-DIM representations

    ( but most algorithms for t.s are difficult to interpret )


Propose a new representation learning framework

  • allows to learn DISCRETE representation

    \(\rightarrow\) smooth & interpretable embeddings with superior clustering performance

  • non-differentiability in discrete representation?

    \(\rightarrow\) introduce gradient-based version of SOM

  • allow for probabilistic interpretation of our method, by using Markov model in latent space


1. Introduction

Interpretable representation learning = uncovering the latent structure

Many unsupervised methods ….. neglect “temporal structure” & “smooth behaviour over time”

\(\rightarrow\) need clustering, where cluster assume a topological structure in low-dim sppace

( such that representations of t.s retain their smoothness in that space )


Define “Temporal Smoothness” in discrete representation space

  • “topological neighborhood relationship”
  • ex) SOM
    • map states from uninterpretable continuous space to low-dim space, with pre-defined topolgically interpretable structure


Propose a novel deep arcthitecture,

1) that learns ”topologically interpretable discrete representations” in a probabilistic fashion

2) develop a ”gradient-based version of SOM”


2. Probabilistic SOM-VAE

2-1. Introducing Topological Structure in the latent space

figure2

Notation

  • input : \(x \in \mathbb{R}^{d}\)
  • latent encoding 1 : \(z_{e} \in \mathbb{R}^{m}\) ……….. \(z_{e}=f_{\theta}(x)\)
  • latent encoding 2 : \(z_{q} \in \mathbb{R}^{m}\) ……….. dictionary \(E=\left\{e_{1}, \ldots, e_{k} \mid e_{i} \in \mathbb{R}^{m}\right\}\) by \(z_{q} \sim p\left(z_{q} \mid z_{e}\right)\).
    • categorical : \(p\left(z_{q} \mid z_{e}\right)=\mathbb{1}\left[z_{q}=\arg \min _{e \in E}\left \mid \mid z_{e}-e\right \mid \mid ^{2}\right]\).
  • reconstruction of input : \(\hat{x}\) . ………..\(\widehat{x}=g_{\phi}(z)\)
    • \(\hat{x}_{e}=g_{\phi}\left(z_{e}\right)\) .
    • \(\hat{x}_{q}=g_{\phi}\left(z_{q}\right)\).
  • \(k\) nodes of SOM : \(V=\left\{v_{1}, \ldots, v_{k}\right\}\), where every node corresponds to….
    • embedding in the data space \(e_{v} \in \mathbb{R}^{d}\)
    • representation in a lower-dimensional discrete space \(m_{v} \in M\)
  • training data : \(\mathcal{D}=\left\{x_{1}, \ldots, x_{n}\right\}\)


training procedure : ( use 2D SOM )

  • winner node \(\tilde{v}\) is chosen for every point \(x_{i}\)
    • according to \(\tilde{v}=\arg \min _{v \in V}\left \mid \mid e_{v}-x_{i}\right \mid \mid ^{2}\)
  • embedding vector for every node \(u \in V\) is then updated
    • according to \(e_{u} \leftarrow e_{u}+N\left(m_{u}, m_{\tilde{v}}\right) \eta\left(x_{i}-e_{u}\right)\),
    • \(N\left(m_{u}, m_{\tilde{v}}\right)\) : neighborhood function


end-to-end

  • cannot use the standard SOM training algorithm

  • devise a loss function term whose gradient corresponds to a weighted version of the original SOM update

    • any time an embedding \(e_{i, j}\) at position \((i, j)\) in the map gets updated,

      also updates all the embeddings in its immediate neighborhood \(N\left(e_{i, j}\right)\).

    • neighborhood : \(N\left(e_{i, j}\right)=\left\{e_{i-1, j}, e_{i+1, j}, e_{i, j-1}, e_{i, j+1}\right\}\)


loss function for a single \(x\) :

  • \(\mathcal{L}_{\text {SOM-VAE }}\left(x, \hat{x}_{q}, \hat{x}_{e}\right)=\mathcal{L}_{\text {reconstruction }}\left(x, \hat{x}_{q}, \hat{x}_{e}\right)+\alpha \mathcal{L}_{\text {commitment }}(x)+\beta \mathcal{L}_{\text {SOM }}(x)\).
    • term 1) \(\mathcal{L}_{\text {reconstruction }}\left(x, \hat{x}_{q}, \hat{x}_{e}\right)=\left \mid \mid x-\hat{x}_{q}\right \mid \mid ^{2}+\left \mid \mid x-\hat{x}_{e}\right \mid \mid ^{2}\)

      • first term : discrete reconstruction loss

      • corresponds to the ELBO of the VAE

        ( assume a uniform prior over \(z_{q}\), the KL-term in the ELBO is constant …ignored )

    • term 2) \(\mathcal{L}_{\text {commitment }}(x)=\left \mid \mid z_{e}(x)-z_{q}(x)\right \mid \mid ^{2}\)

      • due to the nondifferentiability of the embedding assignment, \(\mathcal{L}_{\text {commitment }}\) term has to be explicitly added
    • term 3) \(\mathcal{L}_{\text {SOM }}(x)=\sum_{\tilde{e} \in N\left(z_{q}(x)\right)}\left \mid \mid \tilde{e}-\operatorname{sg}\left[z_{e}(x)\right]\right \mid \mid ^{2}\)

      • \(\mathrm{sg}[\cdot]\) : gradient stopping operator


2-2. Encouraging Smoothness over Time

ultimate goal : predict the development of t.s in interpretable way

1) should be interpreteable 2) should predict well

\(\rightarrow\) Use temporal probabilistic model


Exploit low-dim discrete space induced by SOM to learn a temporal model

model is learned joinlty with SOM-VAE


Loss Function

\(\mathcal{L}\left(x^{t-1}, x^{t}, \hat{x}_{q}^{t}, \hat{x}_{e}^{t}\right)=\mathcal{L}_{\text {SOM-VAE }}\left(x^{t}, \hat{x}_{q}^{t}, \hat{x}_{e}^{t}\right)+\gamma \mathcal{L}_{\text {transitions }}\left(x^{t-1}, x^{t}\right)+\tau \mathcal{L}_{\text {smoothness }}\left(x^{t-1}, x^{t}\right)\).

  • term 1) cluster

  • term 2) prediction

    • \(\mathcal{L}_{\text {transitions }}\left(x^{t-1}, x^{t}\right)=-\log P_{M}\left(z_{q}\left(x^{t-1}\right) \rightarrow z_{q}\left(x^{t}\right)\right)\),

      ( \(P_{M}\left(z_{q}\left(x^{t-1}\right) \rightarrow z_{q}\left(x^{t}\right)\right)\) : probability of a transition from state \(z_{q}\left(x^{t-1}\right)\) to state \(z_{q}\left(x^{t}\right)\) in the Markov model. )

  • term 3) smoothing

    • \(\mathcal{L}_{\text {smoothness }}\left(x^{t-1}, x^{t}\right)=\mathbb{E}_{P_{M}\left(z_{q}\left(x^{t-1}\right) \rightarrow \tilde{e}\right)}\left[\left \mid \mid \tilde{e}-z_{e}\left(x^{t}\right)\right \mid \mid ^{2}\right]\).

Tags:

Categories:

Updated: