[Implicit DGM] 12. Difference of Two Probability Distributions

( 참고 : KAIST 문일철 교수님 : 한국어 기계학습 강좌 심화 3)


Contents

  1. Difference of Two Probability Distributions
  2. Integral Probability Metric (IPM)
    1. Total Variation Distance
    2. Wasserstein metric
    3. Maximum Mean Discrepancy (MMD)
  3. GAN + MMD


1. Difference of Two Probability Distributions

Difference =

  • (1) ratio ( \(p/q\) )
  • (2) difference ( \(\mid p-q \mid\) )


\(f\)-divergence is not the only method!

  • \(f\)-divergence : \(D_{f}(P \mid \mid Q)=\int_{x} q(x) f\left(\frac{p(x)}{q(x)}\right) d x\).
    • ratio method ( ratio = \(\frac{p(x)}{q(x)}\) )
  • But….what if “support of \(q\) \(\neq\) support of \(p\) ” ..??


Requirements ( of \(f\)-divergence )

  • \(q(x)\) needs to be wider than \(p(w)\)

    ( if not … numerical instability! ratio can diverge )

  • Mode collapse

    • ratio in \(f\left(\frac{p(x)}{q(x)}\right)\) could be ignored if \(q(x)\) \(\rightarrow 0\)


Alternative :

why not use “DIFFERENCE method”?

\(\rightarrow\) IPM ( Integral Probability Metrics )


2. Integral Probability Metric (IPM)

\(d_{G}(\mu, v)=\sup _{g \in \mathcal{G}}\left\{ \mid \int g d \mu-\int g d v \mid \right\}\).

  • “difference” method
  • different \(g\) \(\rightarrow\) various types of IPM
    • Ex) Total variation distance, Wasserstein metric, Maximum Mean Discrepancy


(1) Total Variation Distance

\(\mathcal{G}\) : class of all measurable functions taking value in \([-1,1]\)

  • \(\delta\left(P_{r}, P_{g}\right)=\sup _{A \in \Sigma} \mid P_{r}(A)-P_{g}(A) \mid\).


(2) Wasserstein metric

\(\mathcal{G}\) : class of 1-Lipschitz functions

  • ex) Wassertein-1 or Earth-Mover Distance (EMD)

  • \(W\left(P_{r}, P_{g}\right)=\inf _{\gamma \in \Pi\left(P_{r}, P_{g}\right)} \mathrm{E}_{(\mathrm{x}, \mathrm{y}) \sim \gamma}[ \mid \mid \mathrm{x}-\mathrm{y} \mid \mid ]\).


(3) Maximum Mean Discrepancy (MMD)

\(\mathcal{G}\) : unit ball of RKHS

  • Kernel / basis mapping function : 모델러가 직접 설정 가능
  • \(\operatorname{MMD}\left(P_{r}, P_{g}\right)= \mid E_{x \sim P_{r}}[\psi(x)]-E_{y \sim P_{g}}[\psi(x)] \mid _{\mathcal{H}}\).


3. GAN + MMD

( 복습 ) GAN의 \(f\)-divergence 목표

\(\begin{aligned}D_{f}(P \mid \mid Q)&=\int_{x} q(x) f\left(\frac{p(x)}{q(x)}\right) d x \\& \geq \sup _{\tau \in \mathrm{T}}\left\{E_{x \sim p(x)}[\tau(x)]-E_{x \sim q(x)}\left[f^{*}(\tau(x))\right]\right\}\end{aligned}\).


위 식에서, \(f\)-divergence를 IPM으로 바꿔보자!

( \(D_{f}(P \mid \mid Q)\) 대신 \(M M D\left(P_{r}, P_{g}\right)\) )


우선, \(MMD^2\) 식을 정리해보자!

\(M M D^{2}\left(P_{r}, P_{g}\right)= \mid E_{x \sim P_{r}}[\psi(x)]-E_{y \sim P_{g}}[\psi(y)] \mid _{\mathcal{H}}^{2}= \mid \mu_{p}-\mu_{q} \mid _{\mathcal{H}}^{2}\).


위 식에서, \(\psi\)는 커널 함수이다.

  • \(\psi(\mathrm{x})=\mathrm{x}\) : “평균” 비교
  • \(\psi(x)=\left(x, x^{2}\right)\) : “평균 & 분산” 비교


위 식에서, \(\mu\)는 아래와 같다.

  • \(\mu_{p}=\int k(x,) p(d x) \in \mathcal{H}\),
  • 하지만, \(p\), \(q\)를 직접적으로 알 수 없으므로, \(E[f(X)]=\left\langle f, \mu_{p}\right\rangle_{\mathcal{H}}\)


위 \(MMD^2\) 식을 전개해보면,

\(\begin{aligned}M M D^{2}\left(P_{r}, P_{g}\right)&= \mid E_{x \sim P_{r}}[\psi(x)]-E_{y \sim P_{g}}[\psi(y)] \mid _{\mathcal{H}}^{2} \\&=E_{x, x^{\prime}}\left[k\left(x, x^{\prime}\right)\right]-2 E_{x, y}[k(x, y)]+E_{y, y^{\prime}}\left[k\left(y, y^{\prime}\right)\right] \\&=\frac{1}{N(N-1)} \sum_{n \neq n \prime} k\left(x_{n}, x_{n^{\prime}}\right)+\frac{1}{M(M-1)} \sum_{m \neq m \prime} k\left(y_{m}, y_{m^{\prime}}\right)-\frac{2}{M N} \sum_{m=1}^{M} \sum_{n=1}^{N} k\left(y_{m}, x_{n}\right) \end{aligned}\).


위 식에서, \(D\) ( discriminator )에 대한 학습은 어디에 있는가?

  • \(f\)-GAN에서는, optimal \(\tau\)를 근사하는 과정이 곧 optimized \(D\)를 찾는 과정이었다.
  • IPM에서는, \(k\) ( kernel )을 찾는 과정이 곧 \(D\)를 최적화하는 과정이라고 볼 수 있다
    • 파라미터 튜닝 + 하이퍼파라미터 설정 과정

Tags:

Categories:

Updated: