Learning Representations by Maximizing Mutual Information Across Views

Abstract
Method Description
1. Local DIM
2. NCE (Noise-Contrastive Estimation) Loss
3. Data Augmentation
4. Multiscale Mutual Information

0. Abstract

Propose AMDIM, that…

maximizes mutual information between features extracted from multiple views

$\rightarrow$ requires capturing ionformation about high-level factors

1. Method Description

AMDIM ( = Augmented Multiscale DIM )

( DIM = Deep InfoMax )

Step 1) maximize mutual information bewteen two $z$ s

Step 2) maximize mutual information bewteen multiple feature scales simultaneously

(1) Local DIM

Local DIM

= maximize MI between global features & local features

Notation

global features : $f_{1}(x)$
local features : $\left\{f_{7}(x)_{i j}: \forall i, j\right\}$
- produced by an intermediate layer in $$
$d \in\{1,7\}$ : denotes features from the top-most encoder layer with dim $d \times d$
$i$ and $j$ : index the 2 spatial dimes of the array of activations in layer $d$

Meaning of MI

= how much better we can guess the value of $f_{7}(x)_{i j}$ when we know the value of $f_{1}(x)$ than when we do not know the value of $f_{1}(x)$.

Term change

global $\rightarrow$ antecedent features
local $\rightarrow$ consequent features

Construct a distribution : $p\left(f_{1}(x), f_{7}(x)_{i j}\right)$

via ancestral sampling
process
- step 1) sample an input $x \sim \mathcal{D}$
- step 2) sample spatial indices $i \sim u(i)$ and $j \sim u(j)$
- step 3) compute features $f_{1}(x)$ and $f_{7}(x)_{i j}$.

Given $p\left(f_{1}(x)\right), p\left(f_{7}(x)_{i j}\right)$ and $p\left(f_{1}(x), f_{7}(x)_{i j}\right)$

$\rightarrow$ local DIM seeks an encoder that maximizes MI $I\left(f_{1}(x) ; f_{7}(x)_{i j}\right)$ in $p\left(f_{1}(x), f_{7}(x)_{i j}\right)$

(2) NCE (Noise-Contrastive Estimation) Loss

Maximize the NCE lower bound on $I\left(f_{1}(x) ; f_{7}(x)_{i j}\right)$, by minimizing …

$\underset{\left(f_{1}(x), f_{7}(x)_{i j}\right)}{\mathbb{E}}\left[\underset{N_{7}}{\mathbb{E}}\left[\mathcal{L}_{\Phi}\left(f_{1}(x), f_{7}(x)_{i j}, N_{7}\right)\right]\right]$.

Positive & Negative

Positive : from joint distn $\rightarrow$ $\left(f_{1}(x), f_{7}(x)_{i j}\right) \sim p\left(f_{1}(x), f_{7}(x)_{i j}\right)$
Negative ( = $N_{7}$ ) : from marginal distn $\rightarrow$ $p\left(f_{7}(x)_{i j}\right)$

$\mathcal{L}_{\Phi}\left(f_{1}, f_{7}, N_{7}\right)=-\log \frac{\exp \left(\Phi\left(f_{1}, f_{7}\right)\right)}{\sum_{\tilde{f}_{7} \in N_{7} \cup\left\{f_{7}\right\}} \exp \left(\Phi\left(f_{1}, \tilde{f}_{7}\right)\right)}$.

(3) Data Augmentation

Local DIM $\rightarrow$ Local DIM + Data Augmentation

= extends local DIM, by maximizing MI between features from augmented views of each input.

Construct the AUGMENTED feature distn $p_{\mathcal{A}}\left(f_{1}\left(x^{1}\right), f_{7}\left(x^{2}\right)_{i j}\right)$ as …

Step 1) sample an input $x \sim \mathcal{D}$
Step 2) sample augmented images $x^{1} \sim \mathcal{A}(x)$ and $x^{2} \sim \mathcal{A}(x)$
- $\mathcal{A}(x)$ : distn of images generated by applying stochastic DA to $x$
Step 3) sample spatial indices $i \sim u(i)$ and $j \sim u(j)$
Step 4) ompute features $f_{1}\left(x^{1}\right)$ and $f_{7}\left(x^{2}\right)_{i j}$

(4) Multiscale Mutual Information

Local DIM + **Data Augmentation **$\rightarrow$ AMDIM ( Augmented Multiscale DIM )

= extend local DIM, by maximizing MI across multiple feature scales

$n$-to-$m$ infomax costs :

$\underset{\left(f_{n}\left(x^{1}\right)_{i j}, f_{m}\left(x^{2}\right)_{k l}\right)}{\mathbb{E}}\left[\underset{N_{m}}{\mathbb{E}}\left[\mathcal{L}_{\Phi}\left(f_{n}\left(x^{1}\right)_{i j}, f_{m}\left(x^{2}\right)_{k l}, N_{m}\right)\right]\right]$.

ex) $p_{\mathcal{A}}\left(f_{5}\left(x^{1}\right)_{i j}, f_{7}\left(x^{2}\right)_{k l}\right)$
ex) $p_{\mathcal{A}}\left(f_{5}\left(x^{1}\right)_{i j}, f_{5}\left(x^{2}\right)_{k l}\right)$
ex) $p_{\mathcal{A}}\left(f_{1}\left(x^{1}\right), f_{5}\left(x^{2}\right)_{k l}\right)$

Twitter Facebook LinkedIn

(paper 20) AMDIM

Seunghan Lee