[Paper Review] 02.GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium


Contents

  1. Abstract
  2. Introduction
  3. TTUR ( Two Time-Scale Update Rule ) for GANs
  4. Experiments


0. Abstract

GAN : excel at creating realistic images

BUT, “Convergence” has still not be proven!

\(\rightarrow\) propose a two time-scale update rule (TTUR)


TTUR : has an INDIVIDUAL LEARNING RATE for both discriminator & generator


1. Introduction

  • FID score = LOWER, the BETTER
  • Original GAN vs TTUR GAN

figure2


Contributions

  • 1) two time-scale update rule for GANs

  • 2) GANS trained with TTUR converge to stationary local Nash equilibrium

  • 3) introduce FID (Frechet Inception Distance) to evaluate GANS

    ( more consistent than Inception score )


2. TTUR ( Two Time-Scale Update Rule ) for GANs

Notation

  • \(D(. ; \boldsymbol{w})\) : discriminator
  • \(G(. ; \boldsymbol{\theta})\) : generator


Gradients

Learning rate is based on..

  • 1) a stochastic gradient \(\tilde{\boldsymbol{g}}(\boldsymbol{\theta}, \boldsymbol{w})\) of the discriminator’s loss function \(\mathcal{L}_{D}\)
  • 2) a stochastic gradient \(\tilde{\boldsymbol{h}}(\boldsymbol{\theta}, \boldsymbol{w})\) of the generator’s loss function \(\mathcal{L}_{G}\)

( Gradients \(\tilde{\boldsymbol{g}}(\boldsymbol{\theta}, \boldsymbol{w})\) and \(\tilde{\boldsymbol{h}}(\boldsymbol{\theta}, \boldsymbol{w})\) are stochastic….. since they use mini-batches of \(m\) real world samples \(\boldsymbol{x}^{(i)}, 1 \leqslant i \leqslant m\) and \(m\) synthetic samples \(\boldsymbol{z}^{(i)}, 1 \leqslant i \leqslant m\) which are randomly chosen )


True gradients :

  • \(\boldsymbol{g}(\boldsymbol{\theta}, \boldsymbol{w})=\nabla_{w} \mathcal{L}_{D}\)…..\(\tilde{\boldsymbol{g}}(\boldsymbol{\theta}, \boldsymbol{w})=\boldsymbol{g}(\boldsymbol{\theta}, \boldsymbol{w})+\boldsymbol{M}^{(w)}\)

  • \(\boldsymbol{h}(\boldsymbol{\theta}, \boldsymbol{w})=\nabla_{\theta} \mathcal{L}_{G}\)…..\(\tilde{\boldsymbol{h}}(\boldsymbol{\theta}, \boldsymbol{w})=\boldsymbol{h}(\boldsymbol{\theta}, \boldsymbol{w})+\boldsymbol{M}^{(\theta)}\)

    ( with random variables \(\boldsymbol{M}^{(w)}\) and \(\boldsymbol{M}^{(\theta)}\) )

\(\rightarrow\) \(\tilde{\boldsymbol{g}}(\boldsymbol{\theta}, \boldsymbol{w})\) and \(\tilde{\boldsymbol{h}}(\boldsymbol{\theta}, \boldsymbol{w})\) are stochastic approximations to the true gradients!


Learning Rate \(b(n), a(n)\)

  • discriminator’s LR : \(b(n)\)
  • generator’s LR : \(a(n)\)


3. Experiments

Performance Measures

defining appropriate performance measures for generative models is HARD

  • ex) likelihood ( estimated by annealed importance sampling )

    \(\rightarrow\) [drawback] heavily depends on noise assumptions

  • ex) Inception Score

    • correlates with human judgements
    • generated samples \(\rightarrow\) inception model trained on Image Net
    • meaningful image = LOW entropy

    \(\rightarrow\) [drawback] statistics of real world samples are NOT USED & compared to statistics of synthetic samples


Improve the Inception Score!

FID (Frechet Inception Distance) :

Fréchet distance \(d(., .)\) , between

  • 1) the Gaussian with mean \((\boldsymbol{m}, \boldsymbol{C})\) obtained from \(p(\cdot)\)
  • 2) the Gaussian with mean \(\left(\boldsymbol{m}_{w}, \boldsymbol{C}_{w}\right)\) obtained from \(p_{w}\) (.)

\(d^{2}\left((\boldsymbol{m}, \boldsymbol{C}),\left(\boldsymbol{m}_{w}, \boldsymbol{C}_{w}\right)\right)= \mid \mid \boldsymbol{m}-\boldsymbol{m}_{w} \mid \mid _{2}^{2}+\operatorname{Tr}\left(\boldsymbol{C}+\boldsymbol{C}_{w}-2\left(\boldsymbol{C} \boldsymbol{C}_{w}\right)^{1 / 2}\right)\).

Tags:

Categories:

Updated: