[Implicit DGM] 12. Difference of Two Probability Distributions

( 참고 : KAIST 문일철 교수님 : 한국어 기계학습 강좌 심화 3)

Difference of Two Probability Distributions
Integral Probability Metric (IPM)
1. Total Variation Distance
2. Wasserstein metric
3. Maximum Mean Discrepancy (MMD)
GAN + MMD

1. Difference of Two Probability Distributions

Difference =

(1) ratio ( \(p/q\) )
(2) difference ( \(\mid p-q \mid\) )

\(f\)-divergence is not the only method!

\(f\)-divergence : \(D_{f}(P \mid \mid Q)=\int_{x} q(x) f\left(\frac{p(x)}{q(x)}\right) d x\).
- ratio method ( ratio = \(\frac{p(x)}{q(x)}\) )
But….what if “support of \(q\) \(\neq\) support of \(p\) ” ..??

Requirements ( of \(f\)-divergence )

\(q(x)\) needs to be wider than \(p(w)\)

( if not … numerical instability! ratio can diverge )
Mode collapse
- ratio in \(f\left(\frac{p(x)}{q(x)}\right)\) could be ignored if \(q(x)\) \(\rightarrow 0\)

Alternative :

why not use “DIFFERENCE method”?

\(\rightarrow\) IPM ( Integral Probability Metrics )

2. Integral Probability Metric (IPM)

\(d_{G}(\mu, v)=\sup _{g \in \mathcal{G}}\left\{ \mid \int g d \mu-\int g d v \mid \right\}\).

“difference” method
different \(g\) \(\rightarrow\) various types of IPM
- Ex) Total variation distance, Wasserstein metric, Maximum Mean Discrepancy

(1) Total Variation Distance

\(\mathcal{G}\) : class of all measurable functions taking value in \([-1,1]\)

\(\delta\left(P_{r}, P_{g}\right)=\sup _{A \in \Sigma} \mid P_{r}(A)-P_{g}(A) \mid\).

(2) Wasserstein metric

\(\mathcal{G}\) : class of 1-Lipschitz functions

ex) Wassertein-1 or Earth-Mover Distance (EMD)
\(W\left(P_{r}, P_{g}\right)=\inf _{\gamma \in \Pi\left(P_{r}, P_{g}\right)} \mathrm{E}_{(\mathrm{x}, \mathrm{y}) \sim \gamma}[ \mid \mid \mathrm{x}-\mathrm{y} \mid \mid ]\).

(3) Maximum Mean Discrepancy (MMD)

\(\mathcal{G}\) : unit ball of RKHS

Kernel / basis mapping function : 모델러가 직접 설정 가능
\(\operatorname{MMD}\left(P_{r}, P_{g}\right)= \mid E_{x \sim P_{r}}[\psi(x)]-E_{y \sim P_{g}}[\psi(x)] \mid _{\mathcal{H}}\).

3. GAN + MMD

( 복습 ) GAN의 \(f\)-divergence 목표

\(\begin{aligned}D_{f}(P \mid \mid Q)&=\int_{x} q(x) f\left(\frac{p(x)}{q(x)}\right) d x \\& \geq \sup _{\tau \in \mathrm{T}}\left\{E_{x \sim p(x)}[\tau(x)]-E_{x \sim q(x)}\left[f^{*}(\tau(x))\right]\right\}\end{aligned}\).

위 식에서, \(f\)-divergence를 IPM으로 바꿔보자!

( \(D_{f}(P \mid \mid Q)\) 대신 \(M M D\left(P_{r}, P_{g}\right)\) )

우선, \(MMD^2\) 식을 정리해보자!

\(M M D^{2}\left(P_{r}, P_{g}\right)= \mid E_{x \sim P_{r}}[\psi(x)]-E_{y \sim P_{g}}[\psi(y)] \mid _{\mathcal{H}}^{2}= \mid \mu_{p}-\mu_{q} \mid _{\mathcal{H}}^{2}\).

위 식에서, \(\psi\)는 커널 함수이다.

\(\psi(\mathrm{x})=\mathrm{x}\) : “평균” 비교
\(\psi(x)=\left(x, x^{2}\right)\) : “평균 & 분산” 비교

위 식에서, \(\mu\)는 아래와 같다.

\(\mu_{p}=\int k(x,) p(d x) \in \mathcal{H}\),
하지만, \(p\), \(q\)를 직접적으로 알 수 없으므로, \(E[f(X)]=\left\langle f, \mu_{p}\right\rangle_{\mathcal{H}}\)

위 \(MMD^2\) 식을 전개해보면,

\(\begin{aligned}M M D^{2}\left(P_{r}, P_{g}\right)&= \mid E_{x \sim P_{r}}[\psi(x)]-E_{y \sim P_{g}}[\psi(y)] \mid _{\mathcal{H}}^{2} \\&=E_{x, x^{\prime}}\left[k\left(x, x^{\prime}\right)\right]-2 E_{x, y}[k(x, y)]+E_{y, y^{\prime}}\left[k\left(y, y^{\prime}\right)\right] \\&=\frac{1}{N(N-1)} \sum_{n \neq n \prime} k\left(x_{n}, x_{n^{\prime}}\right)+\frac{1}{M(M-1)} \sum_{m \neq m \prime} k\left(y_{m}, y_{m^{\prime}}\right)-\frac{2}{M N} \sum_{m=1}^{M} \sum_{n=1}^{N} k\left(y_{m}, x_{n}\right) \end{aligned}\).

위 식에서, \(D\) ( discriminator )에 대한 학습은 어디에 있는가?

\(f\)-GAN에서는, optimal \(\tau\)를 근사하는 과정이 곧 optimized \(D\)를 찾는 과정이었다.
IPM에서는, \(k\) ( kernel )을 찾는 과정이 곧 \(D\)를 최적화하는 과정이라고 볼 수 있다
- 파라미터 튜닝 + 하이퍼파라미터 설정 과정

Twitter Facebook LinkedIn

[Implicit DGM] 12. Difference of Two Probability Distributions

Seunghan Lee

[Implicit DGM] 12. Difference of Two Probability Distributions

Contents

1. Difference of Two Probability Distributions

\(f\)-divergence is not the only method!

2. Integral Probability Metric (IPM)

(1) Total Variation Distance

(2) Wasserstein metric

(3) Maximum Mean Discrepancy (MMD)

3. GAN + MMD

You May Also Enjoy