IntroVAE : Introspective Variational Autoencoders for Photographic Image Synthesis (NeurIPS 2018)

Abstract

introduce IntroVAE for synthesizing high-resolution photographic images

IntroVAE

Introspective VAE
inference & generator are JOINTLY trained
- generator : required to reconstruct input ( from the noisy output of inference model )
- inference model : encourage to classify REAL vs GENERATED

1. Introduction

generative models example

VAEs, GANs, RealNVP, GMMNs ( Generative moment matching networks )
2 most prominent models : VAEs & GANs

VAE & GAN

(GAN) require multi-scale discriminators to decompose “high” \(\rightarrow\) “from-low-to-high” resolution tasks
(GAN/VAE) imposes discriminator on data space to improve the quality of the result generated
also “hybrid models” exists…. still lage behind GANs in image quality

Introduce IntroVAE

simple, yet effective approach to training VAEs!
model can self-estimate the differences between REAL vs GENERATED
(1) Inference model
- MINIMIZE divergence of “approximate posterior” & “prior for REAL data”
- MAXIMIZE divergence of “approximate posterior” & “prior for FAKE data”
(2) Generator model
- mislead the inference model
- MINIMIZE the divergence of generated samples
acts like VAE for real data

acts like GAN for generated data

Contribution :

1) new training technique for VAEs, in introspective manner

( model itself estimates the difference between REAL vs FAKE )
2) propose a single-stage adversarial model

2. Background

VAEs

\(\log p_{\theta}(x) \geq E_{q_{\phi}(z \mid x)} \log p_{\theta}(x \mid z)-D_{K L}\left(q_{\phi}(z \mid x) \mid \mid p(z)\right)\).
limitation : generated samples are BLURRY

GANs

min-max game

( 2 models : \(G\) ( generator ) & \(D\) ( discriminator ) )
\(\min _{G} \max _{D} E_{x \sim p_{\text {data }}(x)}[\log D(x)]+E_{z \sim p_{z}(z)}[\log (1-D(G(z)))]\).
promising tools for generating sharp images, but difficult to train

Hybrid Models of VAEs and GANs

usually consists of 3 components

encoder
decoder
discriminator

[Ulyanov et al] adversarial generator-encoder networks (AGE)

share similarities with IntroVAE

[Brock et al] Introspective Adversarial Network (IAN)

encoder & discriminator share most of the layers except last layer
adversarial loss is a “variation of the standard GAN loss”

3. Approach

how to train VAEs in introspective manner?

1) needs to discriminate REAL vs FAKE
2) should mislead 1)

Overview

Select inference model (=encoder) as “discriminator of GANs”
Select generator model as “generator of GANs”
Train Jointly!

2 components in ELBO of VAEs

\(L_{A E}=-E_{q_{\phi}(z \mid x)} \log p_{\theta}(x \mid z)\).
\(L_{R E G}=D_{K L}\left(q_{\phi}(z \mid x) \mid \mid p(z)\right)\).

\(\rightarrow\) modified combination of these 2 terms

3-1. Adversarial Distribution Matching

(1) Inference model

minimize \(L_{R E G}\)

( encourage posterior of REAL data to match prior )
maximize \(L_{R E G}\)

( encourage posterior of FAKE data to deviate from prior)

(2) Generator

produce FAKE that have small \(L_{R E G}\)

2 different losses :

to train inference model \(E\) : \(L_{E}(x, z)=E(x)+[m-E(G(z))]^{+}\)
to train generator \(G\) : \(L_{G}(z)=E(G(z))\)

where \(E(x)=D_{K L}\left(q_{\phi}(z \mid x) \mid \mid p(z)\right)\) & \([\cdot]^{+}=\max (0, \cdot)\)

Relationships with other GANs

proposed method appears to be similar to Energy-based GANs (EBGAN)
proposed KL-divergence can be seen as a specific type of energy function

3-2. Introspective Variational Inference

(1) Prior : \(N(0, I)\)

(2) Posterior : \(q_{\phi}(z \mid x)=N\left(z ; \mu, \sigma^{2}\right)\)

input \(z\) of \(G\) is sampled from posterior, using reparam trick

(3) KL-divergence ( \(L_{R E G}\) )

\(L_{R E G}(z ; \mu, \sigma)=\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{M_{z}}\left(1+\log \left(\sigma_{i j}^{2}\right)-\mu_{i j}^{2}-\sigma_{i j}^{2}\right)\).

(4) Reconstruction error ( \(L_{A E}\) )…. MSE

\[L_{A E}\left(x, x_{r}\right)=\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{M_{x}} \mid \mid x_{r, i j}-x_{i j}\mid \mid_{F}^{2}\]

Like VAE/GAN..

train to discriminate samples from both the (1) model samples & (2) reconstructions

combined use of samples from \(p(z)\) and \(q_{\phi}(z \mid x)\) is expected to provide a more useful signal for the model to learn more expressive latent code and synthesize more realistic samples

Total Loss :

\(\begin{aligned} L_{E} &=L_{R E G}(z)+\alpha \sum_{s=r, p}\left[m-L_{R E G}\left(z_{s}\right)\right]^{+}+\beta L_{A E}\left(x, x_{r}\right) \\ &=L_{R E G}(\operatorname{Enc}(x))+\alpha \sum_{s=r, p}\left[m-L_{R E G}\left(\operatorname{Enc}\left(n g\left(x_{s}\right)\right)\right)\right]^{+}+\beta L_{A E}\left(x, x_{r}\right) \end{aligned}\).

Twitter Facebook LinkedIn

74.IntroVAE, Introspective Variational Autoencoders for Photographic Image SynthesisNatural Gradient

Seunghan Lee