IntroVAE : Introspective Variational Autoencoders for Photographic Image Synthesis (NeurIPS 2018)

figure2

Abstract

introduce IntroVAE for synthesizing high-resolution photographic images

IntroVAE

  • Introspective VAE
  • inference & generator are JOINTLY trained
    • generator : required to reconstruct input ( from the noisy output of inference model )
    • inference model : encourage to classify REAL vs GENERATED


1. Introduction

generative models example

  • VAEs, GANs, RealNVP, GMMNs ( Generative moment matching networks )

  • 2 most prominent models : VAEs & GANs


VAE & GAN

  • (GAN) require multi-scale discriminators to decompose “high” \(\rightarrow\) “from-low-to-high” resolution tasks

  • (GAN/VAE) imposes discriminator on data space to improve the quality of the result generated
  • also “hybrid models” exists…. still lage behind GANs in image quality


Introduce IntroVAE

  • simple, yet effective approach to training VAEs!
  • model can self-estimate the differences between REAL vs GENERATED
  • (1) Inference model
    • MINIMIZE divergence of “approximate posterior” & “prior for REAL data”
    • MAXIMIZE divergence of “approximate posterior” & “prior for FAKE data”
  • (2) Generator model

    • mislead the inference model
    • MINIMIZE the divergence of generated samples
  • acts like VAE for real data

    acts like GAN for generated data


Contribution :

  • 1) new training technique for VAEs, in introspective manner

    ( model itself estimates the difference between REAL vs FAKE )

  • 2) propose a single-stage adversarial model


2. Background

VAEs

  • \(\log p_{\theta}(x) \geq E_{q_{\phi}(z \mid x)} \log p_{\theta}(x \mid z)-D_{K L}\left(q_{\phi}(z \mid x) \mid \mid p(z)\right)\).
  • limitation : generated samples are BLURRY


GANs

  • min-max game

    ( 2 models : \(G\) ( generator ) & \(D\) ( discriminator ) )

  • \(\min _{G} \max _{D} E_{x \sim p_{\text {data }}(x)}[\log D(x)]+E_{z \sim p_{z}(z)}[\log (1-D(G(z)))]\).

  • promising tools for generating sharp images, but difficult to train


Hybrid Models of VAEs and GANs

usually consists of 3 components

  • encoder
  • decoder
  • discriminator


[Ulyanov et al] adversarial generator-encoder networks (AGE)

  • share similarities with IntroVAE

[Brock et al] Introspective Adversarial Network (IAN)

  • encoder & discriminator share most of the layers except last layer
  • adversarial loss is a “variation of the standard GAN loss”


3. Approach

how to train VAEs in introspective manner?

  • 1) needs to discriminate REAL vs FAKE

  • 2) should mislead 1)


Overview

  • Select inference model (=encoder) as “discriminator of GANs”

  • Select generator model as “generator of GANs”

  • Train Jointly!


2 components in ELBO of VAEs

  • \(L_{A E}=-E_{q_{\phi}(z \mid x)} \log p_{\theta}(x \mid z)\).
  • \(L_{R E G}=D_{K L}\left(q_{\phi}(z \mid x) \mid \mid p(z)\right)\).

\(\rightarrow\) modified combination of these 2 terms


3-1. Adversarial Distribution Matching

(1) Inference model

  • minimize \(L_{R E G}\)

    ( encourage posterior of REAL data to match prior )

  • maximize \(L_{R E G}\)

    ( encourage posterior of FAKE data to deviate from prior)


(2) Generator

  • produce FAKE that have small \(L_{R E G}\)


2 different losses :

  • to train inference model \(E\) : \(L_{E}(x, z)=E(x)+[m-E(G(z))]^{+}\)
  • to train generator \(G\) : \(L_{G}(z)=E(G(z))\)

where \(E(x)=D_{K L}\left(q_{\phi}(z \mid x) \mid \mid p(z)\right)\) & \([\cdot]^{+}=\max (0, \cdot)\)


Relationships with other GANs

  • proposed method appears to be similar to Energy-based GANs (EBGAN)
  • proposed KL-divergence can be seen as a specific type of energy function


3-2. Introspective Variational Inference

(1) Prior : \(N(0, I)\)

(2) Posterior : \(q_{\phi}(z \mid x)=N\left(z ; \mu, \sigma^{2}\right)\)

  • input \(z\) of \(G\) is sampled from posterior, using reparam trick

(3) KL-divergence ( \(L_{R E G}\) )

  • \(L_{R E G}(z ; \mu, \sigma)=\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{M_{z}}\left(1+\log \left(\sigma_{i j}^{2}\right)-\mu_{i j}^{2}-\sigma_{i j}^{2}\right)\).

(4) Reconstruction error ( \(L_{A E}\) )…. MSE

  • \[L_{A E}\left(x, x_{r}\right)=\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{M_{x}} \mid \mid x_{r, i j}-x_{i j}\mid \mid_{F}^{2}\]


Like VAE/GAN..

  • train to discriminate samples from both the (1) model samples & (2) reconstructions

combined use of samples from \(p(z)\) and \(q_{\phi}(z \mid x)\) is expected to provide a more useful signal for the model to learn more expressive latent code and synthesize more realistic samples


Total Loss :

\(\begin{aligned} L_{E} &=L_{R E G}(z)+\alpha \sum_{s=r, p}\left[m-L_{R E G}\left(z_{s}\right)\right]^{+}+\beta L_{A E}\left(x, x_{r}\right) \\ &=L_{R E G}(\operatorname{Enc}(x))+\alpha \sum_{s=r, p}\left[m-L_{R E G}\left(\operatorname{Enc}\left(n g\left(x_{s}\right)\right)\right)\right]^{+}+\beta L_{A E}\left(x, x_{r}\right) \end{aligned}\).

figure2

figure2

Categories:

Updated: