( Skip the basic parts + not important contents )
2. Probability Distributions
2-1. Binary Variables
Bernoulli
- (original) \(\operatorname{Bern}(x \mid \mu)=\mu^{x}(1-\mu)^{1-x}\)
- (log form) \(\ln p(\mathcal{D} \mid \mu)=\sum_{n=1}^{N} \ln p\left(x_{n} \mid \mu\right)=\sum_{n=1}^{N}\left\{x_{n} \ln \mu+\left(1-x_{n}\right) \ln (1-\mu)\right\}\)
Binomial
- \(\operatorname{Bin}(m \mid N, \mu)=\left(\begin{array}{c} N \\ m \end{array}\right) \mu^{m}(1-\mu)^{N-m}\) where \(\left(\begin{array}{l} N \\ m \end{array}\right) \equiv \frac{N !}{(N-m) ! m !}\)
Beta
- \[\operatorname{Beta}(\mu \mid a, b)=\frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} \mu^{a-1}(1-\mu)^{b-1}\]
Beta & Binomial : conjugate
\[p(\mu \mid m, l, a, b) \propto \mu^{m+a-1}(1-\mu)^{l+b-1}\] \[p(\mu \mid m, l, a, b)=\frac{\Gamma(m+a+l+b)}{\Gamma(m+a) \Gamma(l+b)} \mu^{m+a-1}(1-\mu)^{l+b-1}\]2-2. Multinomial Variables
\(p(\mathbf{x} \mid \boldsymbol{\mu})=\prod_{k=1}^{K} \mu_{k}^{x_{k}}\) with constraint \(\sum_{\mathrm{x}} p(\mathrm{x} \mid \boldsymbol{\mu})=\sum_{k=1}^{K} \mu_{k}=1\)
with Lagrange Multiplier…
maximize \(\sum_{k=1}^{K} m_{k} \ln \mu_{k}+\lambda\left(\sum_{k=1}^{K} \mu_{k}-1\right)\)
Multinomial distribution
-
\[\operatorname{Mult}\left(m_{1}, m_{2}, \ldots, m_{K} \mid \boldsymbol{\mu}, N\right)=\left(\begin{array}{c}
N \\
m_{1} m_{2} \ldots m_{K}
\end{array}\right) \prod_{k=1}^{K} \mu_{k}^{m_{k}}\]
where \(\left(\begin{array}{c} N \\ m_{1} m_{2} \ldots m_{K} \end{array}\right)=\frac{N !}{m_{1} ! m_{2} ! \ldots m_{K} !}\) and \(\sum_{k=1}^{K} m_{k}=N\)
Dirichlet distribution
- \[p(\boldsymbol{\mu} \mid \boldsymbol{\alpha}) \propto \prod_{k=1}^{K} \mu_{k}^{\alpha_{k}-1}\]
-
\[\operatorname{Dir}(\boldsymbol{\mu} \mid \boldsymbol{\alpha})=\frac{\Gamma\left(\alpha_{0}\right)}{\Gamma\left(\alpha_{1}\right) \cdots \Gamma\left(\alpha_{K}\right)} \prod_{k=1}^{K} \mu_{k}^{\alpha_{k}-1}\]
where \(\alpha_{0}=\sum_{k=1}^{K} \alpha_{k}\)
Multinomial & Dirichlet : conjugate
\[\begin{aligned} p(\boldsymbol{\mu} \mid \mathcal{D}, \boldsymbol{\alpha}) &=\operatorname{Dir}(\boldsymbol{\mu} \mid \boldsymbol{\alpha}+\mathbf{m}) \\ &=\frac{\Gamma\left(\alpha_{0}+N\right)}{\Gamma\left(\alpha_{1}+m_{1}\right) \cdots \Gamma\left(\alpha_{K}+m_{K}\right)} \prod_{k=1}^{K} \mu_{k}^{\alpha_{k}+m_{k}-1} \end{aligned}\]2-3. Gaussian Distribution
2-3-1. Introduction
MVN (Multivariate Normal)
\[\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \mathbf{\Sigma})=\frac{1}{(2 \pi)^{D / 2}} \frac{1}{\mid\boldsymbol{\Sigma}\mid^{1 / 2}} \exp \left\{-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right\}\]let \(\Delta^{2}=(\mathbf{x}-\boldsymbol{\mu})^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\)
( = called “Mahalanobis distance” from \(\mu\) to \(x\))
covariance : \(\boldsymbol{\Sigma}=\sum_{i=1}^{D} \lambda_{i} \mathbf{u}_{i} \mathbf{u}_{i}^{\mathrm{T}}\)
Mahalanobis distance : \(\Delta^{2}=(\mathbf{x}-\boldsymbol{\mu})^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\) = \(\sum_{i=1}^{D} \frac{y_{i}^{2}}{\lambda_{i}}\)
where \(y_{i}=\mathbf{u}_{i}^{\mathrm{T}}(\mathbf{x}-\boldsymbol{\mu})\)
Gaussian distribution into new coordinate system!
-
Jacobian \(J\) :
\[J_{i j}=\frac{\partial x_{i}}{\partial y_{j}}=U_{j i}\] -
determinant of \(J\) matrix :
\[\mid\mathbf{J}\mid^{2}=\mid\mathbf{U}^{\mathrm{T}}\mid^{2}=\mid\mathbf{U}^{\mathrm{T}}\mid\mid\mathbf{U}\mid=\mid\mathbf{U}^{\mathrm{T}} \mathbf{U}\mid=\mid\mathbf{I}\mid=1\] -
therefore…
\[p(\mathbf{y})=p(\mathbf{x})\mid\mathbf{J}\mid=\prod_{j=1}^{D} \frac{1}{\left(2 \pi \lambda_{j}\right)^{1 / 2}} \exp \left\{-\frac{y_{j}^{2}}{2 \lambda_{j}}\right\}\]
2-3-2. Conditional Gaussian
\[\begin{aligned} \boldsymbol{\mu}_{a \mid b} &=\boldsymbol{\mu}_{a}+\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1}\left(\mathbf{x}_{b}-\boldsymbol{\mu}_{b}\right) \\ \boldsymbol{\Sigma}_{a \mid b} &=\boldsymbol{\Sigma}_{a a}-\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \boldsymbol{\Sigma}_{b a} \end{aligned}\]2-3-3. Bayes’ theorem for Gaussian variables
marginal : \(p(\mathbf{x}) =\mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Lambda}^{-1}\right)\)
conditional : \(p(\mathbf{y} \mid \mathbf{x}) =\mathcal{N}\left(\mathbf{y} \mid \mathbf{A} \mathbf{x}+\mathbf{b}, \mathbf{L}^{-1}\right)\)
log joint : ( let \(z = (x \;\; y)^T\))
\[\begin{aligned} \ln p(\mathbf{z})=& \ln p(\mathbf{x})+\ln p(\mathbf{y} \mid \mathbf{x}) \\ =&-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\mathrm{T}} \boldsymbol{\Lambda}(\mathbf{x}-\boldsymbol{\mu}) -\frac{1}{2}(\mathbf{y}-\mathbf{A} \mathbf{x}-\mathbf{b})^{\mathrm{T}} \mathbf{L}(\mathbf{y}-\mathbf{A} \mathbf{x}-\mathbf{b})+\mathrm{const} \end{aligned}\]second order term of \(\ln p(\mathbf{z})\) : ( to find the precision )
\[\begin{array}{l} -\frac{1}{2} \mathrm{x}^{\mathrm{T}}\left(\Lambda+\mathrm{A}^{\mathrm{T}} \mathrm{LA}\right) \mathrm{x}-\frac{1}{2} \mathrm{y}^{\mathrm{T}} \mathrm{Ly}+\frac{1}{2} \mathrm{y}^{\mathrm{T}} \mathrm{LAx}+\frac{1}{2} \mathrm{x}^{\mathrm{T}} \mathrm{A}^{\mathrm{T}} \mathrm{Ly} \\ \quad=\quad-\frac{1}{2}\left(\begin{array}{l} \mathrm{x} \\ \mathrm{y} \end{array}\right)^{\mathrm{T}}\left(\begin{array}{cc} \Lambda+\mathrm{A}^{\mathrm{T}} \mathrm{LA} & -\mathrm{A}^{\mathrm{T}} \mathrm{L} \\ -\mathrm{L} \mathrm{A} & \mathrm{L} \end{array}\right)\left(\begin{array}{l} \mathrm{x} \\ \mathrm{y} \end{array}\right)=-\frac{1}{2} \mathrm{z}^{\mathrm{T}} \mathrm{Rz} \end{array}\]\(\therefore\) Gaussian distribution over \(z\) has the precision matrix \(R\) as below!
precision
- \(\mathbf{R}=\left(\begin{array}{cc} \Lambda+\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{A} & -\mathbf{A}^{\mathrm{T}} \mathbf{L} \\ -\mathbf{L} \mathbf{A} & \mathbf{L} \end{array}\right)\).
covariance
- \(\operatorname{cov}[\mathbf{z}]=\mathbf{R}^{-1}=\left(\begin{array}{cc} \mathbf{\Lambda}^{-1} & \boldsymbol{\Lambda}^{-1} \mathbf{A}^{\mathrm{T}} \\ \mathbf{A} \Lambda^{-1} & \mathbf{L}^{-1}+\mathbf{A} \mathbf{\Lambda}^{-1} \mathbf{A}^{\mathrm{T}} \end{array}\right)\).
2-3-4. Bayesian Inference for the Gaussian
Mean
(1) prior : \(p(\mu)=\mathcal{N}\left(\mu \mid \mu_{0}, \sigma_{0}^{2}\right)\)
(2) likelihood : \(p(\mathbf{X} \mid \mu)=\prod_{n=1}^{N} p\left(x_{n} \mid \mu\right)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{N / 2}} \exp \left\{-\frac{1}{2 \sigma^{2}} \sum_{n=1}^{N}\left(x_{n}-\mu\right)^{2}\right\}\)
(3) posterior : \(p(\mu \mid \mathbf{X})=\mathcal{N}\left(\mu \mid \mu_{N}, \sigma_{N}^{2}\right)\)
- \(\mu_{N}=\frac{\sigma^{2}}{N \sigma_{0}^{2}+\sigma^{2}} \mu_{0}+\frac{N \sigma_{0}^{2}}{N \sigma_{0}^{2}+\sigma^{2}} \mu_{\mathrm{ML}}\).
- \(\frac{1}{\sigma_{N}^{2}}=\frac{1}{\sigma_{0}^{2}}+\frac{N}{\sigma^{2}}\).
Can make sequential update!
\[p(\boldsymbol{\mu} \mid D) \propto\left[p(\boldsymbol{\mu}) \prod_{n=1}^{N-1} p\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}\right)\right] p\left(\mathbf{x}_{N} \mid \boldsymbol{\mu}\right)\]Precision
\[p(\mathbf{X} \mid \lambda)=\prod_{n=1}^{N} \mathcal{N}\left(x_{n} \mid \mu, \lambda^{-1}\right) \propto \lambda^{N / 2} \exp \left\{-\frac{\lambda}{2} \sum_{n=1}^{N}\left(x_{n}-\mu\right)^{2}\right\}\]conjugate prior : “Gamma distribution”
- \(\operatorname{Gam}(\lambda \mid a, b)=\frac{1}{\Gamma(a)} b^{a} \lambda^{a-1} \exp (-b \lambda)\).
posterior distribution :
- \[p(\lambda \mid \mathbf{X}) \propto \lambda^{a_{0}-1} \lambda^{N / 2} \exp \left\{-b_{0} \lambda-\frac{\lambda}{2} \sum_{n=1}^{N}\left(x_{n}-\mu\right)^{2}\right\}\]
-
\[\operatorname{Gam}\left(\lambda \mid a_{N}, b_{N}\right)\]
- \(a_{N} =a_{0}+\frac{N}{2}\).
- \(b_{N} =b_{0}+\frac{1}{2} \sum_{n=1}^{N}\left(x_{n}-\mu\right)^{2}=b_{0}+\frac{N}{2} \sigma_{\mathrm{ML}}^{2}\).
(in case of MVN)
conjugate prior of precision matrix \(\Lambda\)
- \(\mathcal{W}(\boldsymbol{\Lambda} \mid \mathbf{W}, \nu)=B\mid\boldsymbol{\Lambda}\mid^{(\nu-D-1) / 2} \exp \left(-\frac{1}{2} \operatorname{Tr}\left(\mathbf{W}^{-1} \boldsymbol{\Lambda}\right)\right)\),
2-3-5. Student’s t-distribution
- likelihood : univariate Gaussian ( \(\mathcal{N}\left(x \mid \mu, \tau^{-1}\right)\) )
- prior : Gamma prior ( \(\operatorname{Gam}(\tau \mid a, b)\) )
then, integrate out the precision!
\(\operatorname{St}(x \mid \mu, \lambda, \nu)=\frac{\Gamma(\nu / 2+1 / 2)}{\Gamma(\nu / 2)}\left(\frac{\lambda}{\pi \nu}\right)^{1 / 2}\left[1+\frac{\lambda(x-\mu)^{2}}{\nu}\right]^{-\nu / 2-1 / 2}\).
\(\operatorname{St}(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Lambda}, \nu)=\int_{0}^{\infty} \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu},(\eta \boldsymbol{\Lambda})^{-1}\right) \operatorname{Gam}(\eta \mid \nu / 2, \nu / 2) \mathrm{d} \eta\).
2-3-6. Mixture of Gaussians
component & mixing coefficients
\(K\) Gaussian densities : \(p(\mathbf{x})=\sum_{k=1}^{K} \pi_{k} \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)\)
By using of sum & product rules : \(p(\mathbf{x})=\sum_{k=1}^{K} p(k) p(\mathbf{x} \mid k)\)
- \(\pi_{k}=p(k)\).
- \(\mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)= p(x\mid k)\).
Responsibilities \(\gamma_{k}(\mathbf{x})\) : ( \(p(k \mid \mathrm{x})\) is called “responsibility” )
\(\begin{aligned} \gamma_{k}(\mathbf{x}) & \equiv p(k \mid \mathbf{x}) \\ &=\frac{p(k) p(\mathbf{x} \mid k)}{\sum_{l} p(l) p(\mathbf{x} \mid l)} \\ &=\frac{\pi_{k} \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)}{\sum_{l} \pi_{l} \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{l}, \mathbf{\Sigma}_{l}\right)} \end{aligned}\).
2-4. The Exponential Family
\[p(\mathbf{x} \mid \boldsymbol{\eta})=h(\mathbf{x}) g(\boldsymbol{\eta}) \exp \left\{\boldsymbol{\eta}^{\mathrm{T}} \mathbf{u}(\mathbf{x})\right\}\]- \(\eta\) : natural parameters
- \(\mathbf{u}(\mathbf{x})\) : some function of \(x\)
2-4-1. Maximum Likelihood and Sufficient Statistics
\[-\nabla \ln g(\boldsymbol{\eta})=\mathbb{E}[\mathbf{u}(\mathbf{x})]\] \[-\nabla \ln g\left(\boldsymbol{\eta}_{\mathrm{ML}}\right)=\frac{1}{N} \sum_{n=1}^{N} \mathbf{u}\left(\mathbf{x}_{n}\right)\]Solution for the MLE depends on the data “only through \(\mathbf{u}\left(\mathbf{x}_{n}\right)\) “ ( = sufficient statistics )
2-4-2. Conjugate prior
\[p(\boldsymbol{\eta} \mid \boldsymbol{\chi}, \nu) \propto g(\boldsymbol{\eta})^{\nu} \exp \left\{\nu \boldsymbol{\eta}^{\mathrm{T}} \boldsymbol{\chi}\right\}\] \[p(\boldsymbol{\eta} \mid \mathbf{X}, \boldsymbol{\chi}, \nu) \propto g(\boldsymbol{\eta})^{\nu+N} \exp \left\{\boldsymbol{\eta}^{\mathrm{T}}\left(\sum_{n=1}^{N} \mathbf{u}\left(\mathbf{x}_{n}\right)+\nu \boldsymbol{\chi}\right)\right\}\]2-4-3. Noninformative priors
if prior assigns 0 to some value \(\rightarrow\) posterior will also!
Noninformative prior
- intended to have little influence on the posterior
- “letting the data to speak itself”
In the case of continuous parameters….
improper : if the domain of \(\lambda\) is unbounded, prior cannot be correctly normalized
( in practice : improper priors are used when the corresponding posterior is proper )
2 examples of noninformative priors
- (1) \(p(\mu)\) is a constant
- location parameter : \(p(x \mid \mu)=f(x-\mu)\)
- \(p(\mu-c)=p(\mu)\).
- (2) \(p(\sigma) \propto 1 / \sigma\)
- scale parameter : \(p(\sigma)=p\left(\frac{1}{c} \sigma\right) \frac{1}{c}\)
- \(p(x \mid \sigma)=\frac{1}{\sigma} f\left(\frac{x}{\sigma}\right)\).
2-5. Nonparametric Methods
2-5-1. Kernel Density Estimators
\[K=\sum_{n=1}^{N} k\left(\frac{\mathrm{x}-\mathrm{x}_{n}}{h}\right)\]\(h\) : role of smoothing parameters
- small \(h\) : sensitive to noise
- large \(h\) : over-smoothing
any other kernel function \(k(u)\)
- \(k(\mathbf{u}) \geqslant 0\).
- \(\int k(\mathbf{u}) \mathrm{d} \mathbf{u} =1\).