[Implicit DGM] 12. Difference of Two Probability Distributions
( 참고 : KAIST 문일철 교수님 : 한국어 기계학습 강좌 심화 3)
Contents
- Difference of Two Probability Distributions
- Integral Probability Metric (IPM)
- Total Variation Distance
- Wasserstein metric
- Maximum Mean Discrepancy (MMD)
- GAN + MMD
1. Difference of Two Probability Distributions
Difference =
- (1) ratio ( \(p/q\) )
- (2) difference ( \(\mid p-q \mid\) )
\(f\)-divergence is not the only method!
- \(f\)-divergence : \(D_{f}(P \mid \mid Q)=\int_{x} q(x) f\left(\frac{p(x)}{q(x)}\right) d x\).
- ratio method ( ratio = \(\frac{p(x)}{q(x)}\) )
- But….what if “support of \(q\) \(\neq\) support of \(p\) ” ..??
Requirements ( of \(f\)-divergence )
-
\(q(x)\) needs to be wider than \(p(w)\)
( if not … numerical instability! ratio can diverge )
-
Mode collapse
- ratio in \(f\left(\frac{p(x)}{q(x)}\right)\) could be ignored if \(q(x)\) \(\rightarrow 0\)
Alternative :
why not use “DIFFERENCE method”?
\(\rightarrow\) IPM ( Integral Probability Metrics )
2. Integral Probability Metric (IPM)
\(d_{G}(\mu, v)=\sup _{g \in \mathcal{G}}\left\{ \mid \int g d \mu-\int g d v \mid \right\}\).
- “difference” method
- different \(g\) \(\rightarrow\) various types of IPM
- Ex) Total variation distance, Wasserstein metric, Maximum Mean Discrepancy
(1) Total Variation Distance
\(\mathcal{G}\) : class of all measurable functions taking value in \([-1,1]\)
- \(\delta\left(P_{r}, P_{g}\right)=\sup _{A \in \Sigma} \mid P_{r}(A)-P_{g}(A) \mid\).
(2) Wasserstein metric
\(\mathcal{G}\) : class of 1-Lipschitz functions
-
ex) Wassertein-1 or Earth-Mover Distance (EMD)
-
\(W\left(P_{r}, P_{g}\right)=\inf _{\gamma \in \Pi\left(P_{r}, P_{g}\right)} \mathrm{E}_{(\mathrm{x}, \mathrm{y}) \sim \gamma}[ \mid \mid \mathrm{x}-\mathrm{y} \mid \mid ]\).
(3) Maximum Mean Discrepancy (MMD)
\(\mathcal{G}\) : unit ball of RKHS
- Kernel / basis mapping function : 모델러가 직접 설정 가능
- \(\operatorname{MMD}\left(P_{r}, P_{g}\right)= \mid E_{x \sim P_{r}}[\psi(x)]-E_{y \sim P_{g}}[\psi(x)] \mid _{\mathcal{H}}\).
3. GAN + MMD
( 복습 ) GAN의 \(f\)-divergence 목표
\(\begin{aligned}D_{f}(P \mid \mid Q)&=\int_{x} q(x) f\left(\frac{p(x)}{q(x)}\right) d x \\& \geq \sup _{\tau \in \mathrm{T}}\left\{E_{x \sim p(x)}[\tau(x)]-E_{x \sim q(x)}\left[f^{*}(\tau(x))\right]\right\}\end{aligned}\).
위 식에서, \(f\)-divergence를 IPM으로 바꿔보자!
( \(D_{f}(P \mid \mid Q)\) 대신 \(M M D\left(P_{r}, P_{g}\right)\) )
우선, \(MMD^2\) 식을 정리해보자!
\(M M D^{2}\left(P_{r}, P_{g}\right)= \mid E_{x \sim P_{r}}[\psi(x)]-E_{y \sim P_{g}}[\psi(y)] \mid _{\mathcal{H}}^{2}= \mid \mu_{p}-\mu_{q} \mid _{\mathcal{H}}^{2}\).
위 식에서, \(\psi\)는 커널 함수이다.
- \(\psi(\mathrm{x})=\mathrm{x}\) : “평균” 비교
- \(\psi(x)=\left(x, x^{2}\right)\) : “평균 & 분산” 비교
위 식에서, \(\mu\)는 아래와 같다.
- \(\mu_{p}=\int k(x,) p(d x) \in \mathcal{H}\),
- 하지만, \(p\), \(q\)를 직접적으로 알 수 없으므로, \(E[f(X)]=\left\langle f, \mu_{p}\right\rangle_{\mathcal{H}}\)
위 \(MMD^2\) 식을 전개해보면,
\(\begin{aligned}M M D^{2}\left(P_{r}, P_{g}\right)&= \mid E_{x \sim P_{r}}[\psi(x)]-E_{y \sim P_{g}}[\psi(y)] \mid _{\mathcal{H}}^{2} \\&=E_{x, x^{\prime}}\left[k\left(x, x^{\prime}\right)\right]-2 E_{x, y}[k(x, y)]+E_{y, y^{\prime}}\left[k\left(y, y^{\prime}\right)\right] \\&=\frac{1}{N(N-1)} \sum_{n \neq n \prime} k\left(x_{n}, x_{n^{\prime}}\right)+\frac{1}{M(M-1)} \sum_{m \neq m \prime} k\left(y_{m}, y_{m^{\prime}}\right)-\frac{2}{M N} \sum_{m=1}^{M} \sum_{n=1}^{N} k\left(y_{m}, x_{n}\right) \end{aligned}\).
위 식에서, \(D\) ( discriminator )에 대한 학습은 어디에 있는가?
- \(f\)-GAN에서는, optimal \(\tau\)를 근사하는 과정이 곧 optimized \(D\)를 찾는 과정이었다.
- IPM에서는, \(k\) ( kernel )을 찾는 과정이 곧 \(D\)를 최적화하는 과정이라고 볼 수 있다
- 파라미터 튜닝 + 하이퍼파라미터 설정 과정