Learning Representations by Maximizing Mutual Information Across Views


Contents

  1. Abstract
  2. Method Description
    1. Local DIM
    2. NCE (Noise-Contrastive Estimation) Loss
    3. Data Augmentation
    4. Multiscale Mutual Information


0. Abstract

Propose AMDIM, that…

maximizes mutual information between features extracted from multiple views

\(\rightarrow\) requires capturing ionformation about high-level factors


1. Method Description

AMDIM ( = Augmented Multiscale DIM )

( DIM = Deep InfoMax )


Step 1) maximize mutual information bewteen two \(z\) s

Step 2) maximize mutual information bewteen multiple feature scales simultaneously


(1) Local DIM

Local DIM

= maximize MI between global features & local features


Notation

  • global features : \(f_{1}(x)\)
  • local features : \(\left\{f_{7}(x)_{i j}: \forall i, j\right\}\)
    • produced by an intermediate layer in $$
  • \(d \in\{1,7\}\) : denotes features from the top-most encoder layer with dim \(d \times d\)
  • \(i\) and \(j\) : index the 2 spatial dimes of the array of activations in layer \(d\)


Meaning of MI

= how much better we can guess the value of \(f_{7}(x)_{i j}\) when we know the value of \(f_{1}(x)\) than when we do not know the value of \(f_{1}(x)\).


Term change

  • global \(\rightarrow\) antecedent features
  • local \(\rightarrow\) consequent features


Construct a distribution : \(p\left(f_{1}(x), f_{7}(x)_{i j}\right)\)

  • via ancestral sampling
  • process
    • step 1) sample an input \(x \sim \mathcal{D}\)
    • step 2) sample spatial indices \(i \sim u(i)\) and \(j \sim u(j)\)
    • step 3) compute features \(f_{1}(x)\) and \(f_{7}(x)_{i j}\).


Given \(p\left(f_{1}(x)\right), p\left(f_{7}(x)_{i j}\right)\) and \(p\left(f_{1}(x), f_{7}(x)_{i j}\right)\)

\(\rightarrow\) local DIM seeks an encoder that maximizes MI \(I\left(f_{1}(x) ; f_{7}(x)_{i j}\right)\) in \(p\left(f_{1}(x), f_{7}(x)_{i j}\right)\)


(2) NCE (Noise-Contrastive Estimation) Loss

Maximize the NCE lower bound on \(I\left(f_{1}(x) ; f_{7}(x)_{i j}\right)\), by minimizing …

  • \(\underset{\left(f_{1}(x), f_{7}(x)_{i j}\right)}{\mathbb{E}}\left[\underset{N_{7}}{\mathbb{E}}\left[\mathcal{L}_{\Phi}\left(f_{1}(x), f_{7}(x)_{i j}, N_{7}\right)\right]\right]\).


Positive & Negative

  • Positive : from joint distn \(\rightarrow\) \(\left(f_{1}(x), f_{7}(x)_{i j}\right) \sim p\left(f_{1}(x), f_{7}(x)_{i j}\right)\)
  • Negative ( = \(N_{7}\) ) : from marginal distn \(\rightarrow\) \(p\left(f_{7}(x)_{i j}\right)\)


\(\mathcal{L}_{\Phi}\left(f_{1}, f_{7}, N_{7}\right)=-\log \frac{\exp \left(\Phi\left(f_{1}, f_{7}\right)\right)}{\sum_{\tilde{f}_{7} \in N_{7} \cup\left\{f_{7}\right\}} \exp \left(\Phi\left(f_{1}, \tilde{f}_{7}\right)\right)}\).


(3) Data Augmentation

Local DIM \(\rightarrow\) Local DIM + Data Augmentation

= extends local DIM, by maximizing MI between features from augmented views of each input.


Construct the AUGMENTED feature distn \(p_{\mathcal{A}}\left(f_{1}\left(x^{1}\right), f_{7}\left(x^{2}\right)_{i j}\right)\) as …

  • Step 1) sample an input \(x \sim \mathcal{D}\)

  • Step 2) sample augmented images \(x^{1} \sim \mathcal{A}(x)\) and \(x^{2} \sim \mathcal{A}(x)\)
    • \(\mathcal{A}(x)\) : distn of images generated by applying stochastic DA to \(x\)
  • Step 3) sample spatial indices \(i \sim u(i)\) and \(j \sim u(j)\)

  • Step 4) ompute features \(f_{1}\left(x^{1}\right)\) and \(f_{7}\left(x^{2}\right)_{i j}\)


(4) Multiscale Mutual Information

Local DIM + **Data Augmentation **\(\rightarrow\) AMDIM ( Augmented Multiscale DIM )

= extend local DIM, by maximizing MI across multiple feature scales


\(n\)-to-\(m\) infomax costs :

\(\underset{\left(f_{n}\left(x^{1}\right)_{i j}, f_{m}\left(x^{2}\right)_{k l}\right)}{\mathbb{E}}\left[\underset{N_{m}}{\mathbb{E}}\left[\mathcal{L}_{\Phi}\left(f_{n}\left(x^{1}\right)_{i j}, f_{m}\left(x^{2}\right)_{k l}, N_{m}\right)\right]\right]\).

  • ex) \(p_{\mathcal{A}}\left(f_{5}\left(x^{1}\right)_{i j}, f_{7}\left(x^{2}\right)_{k l}\right)\)
  • ex) \(p_{\mathcal{A}}\left(f_{5}\left(x^{1}\right)_{i j}, f_{5}\left(x^{2}\right)_{k l}\right)\)
  • ex) \(p_{\mathcal{A}}\left(f_{1}\left(x^{1}\right), f_{5}\left(x^{2}\right)_{k l}\right)\)

Categories: ,

Updated: