( Skip the basic parts + not important contents )

2. Probability Distributions

2-1. Binary Variables

Bernoulli

  • (original) \(\operatorname{Bern}(x \mid \mu)=\mu^{x}(1-\mu)^{1-x}\)
  • (log form) \(\ln p(\mathcal{D} \mid \mu)=\sum_{n=1}^{N} \ln p\left(x_{n} \mid \mu\right)=\sum_{n=1}^{N}\left\{x_{n} \ln \mu+\left(1-x_{n}\right) \ln (1-\mu)\right\}\)

Binomial

  • \(\operatorname{Bin}(m \mid N, \mu)=\left(\begin{array}{c} N \\ m \end{array}\right) \mu^{m}(1-\mu)^{N-m}\) where \(\left(\begin{array}{l} N \\ m \end{array}\right) \equiv \frac{N !}{(N-m) ! m !}\)

Beta

  • \[\operatorname{Beta}(\mu \mid a, b)=\frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} \mu^{a-1}(1-\mu)^{b-1}\]

Beta & Binomial : conjugate

\[p(\mu \mid m, l, a, b) \propto \mu^{m+a-1}(1-\mu)^{l+b-1}\] \[p(\mu \mid m, l, a, b)=\frac{\Gamma(m+a+l+b)}{\Gamma(m+a) \Gamma(l+b)} \mu^{m+a-1}(1-\mu)^{l+b-1}\]

2-2. Multinomial Variables

\(p(\mathbf{x} \mid \boldsymbol{\mu})=\prod_{k=1}^{K} \mu_{k}^{x_{k}}\) with constraint \(\sum_{\mathrm{x}} p(\mathrm{x} \mid \boldsymbol{\mu})=\sum_{k=1}^{K} \mu_{k}=1\)

with Lagrange Multiplier…

maximize \(\sum_{k=1}^{K} m_{k} \ln \mu_{k}+\lambda\left(\sum_{k=1}^{K} \mu_{k}-1\right)\)

Multinomial distribution

  • \[\operatorname{Mult}\left(m_{1}, m_{2}, \ldots, m_{K} \mid \boldsymbol{\mu}, N\right)=\left(\begin{array}{c} N \\ m_{1} m_{2} \ldots m_{K} \end{array}\right) \prod_{k=1}^{K} \mu_{k}^{m_{k}}\]

    where \(\left(\begin{array}{c} N \\ m_{1} m_{2} \ldots m_{K} \end{array}\right)=\frac{N !}{m_{1} ! m_{2} ! \ldots m_{K} !}\) and \(\sum_{k=1}^{K} m_{k}=N\)

Dirichlet distribution

  • \[p(\boldsymbol{\mu} \mid \boldsymbol{\alpha}) \propto \prod_{k=1}^{K} \mu_{k}^{\alpha_{k}-1}\]
  • \[\operatorname{Dir}(\boldsymbol{\mu} \mid \boldsymbol{\alpha})=\frac{\Gamma\left(\alpha_{0}\right)}{\Gamma\left(\alpha_{1}\right) \cdots \Gamma\left(\alpha_{K}\right)} \prod_{k=1}^{K} \mu_{k}^{\alpha_{k}-1}\]

    where \(\alpha_{0}=\sum_{k=1}^{K} \alpha_{k}\)

Multinomial & Dirichlet : conjugate

\[\begin{aligned} p(\boldsymbol{\mu} \mid \mathcal{D}, \boldsymbol{\alpha}) &=\operatorname{Dir}(\boldsymbol{\mu} \mid \boldsymbol{\alpha}+\mathbf{m}) \\ &=\frac{\Gamma\left(\alpha_{0}+N\right)}{\Gamma\left(\alpha_{1}+m_{1}\right) \cdots \Gamma\left(\alpha_{K}+m_{K}\right)} \prod_{k=1}^{K} \mu_{k}^{\alpha_{k}+m_{k}-1} \end{aligned}\]

2-3. Gaussian Distribution

2-3-1. Introduction

MVN (Multivariate Normal)

\[\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \mathbf{\Sigma})=\frac{1}{(2 \pi)^{D / 2}} \frac{1}{\mid\boldsymbol{\Sigma}\mid^{1 / 2}} \exp \left\{-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right\}\]

let \(\Delta^{2}=(\mathbf{x}-\boldsymbol{\mu})^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\)

( = called “Mahalanobis distance” from \(\mu\) to \(x\))

covariance : \(\boldsymbol{\Sigma}=\sum_{i=1}^{D} \lambda_{i} \mathbf{u}_{i} \mathbf{u}_{i}^{\mathrm{T}}\)

Mahalanobis distance : \(\Delta^{2}=(\mathbf{x}-\boldsymbol{\mu})^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\) = \(\sum_{i=1}^{D} \frac{y_{i}^{2}}{\lambda_{i}}\)

​ where \(y_{i}=\mathbf{u}_{i}^{\mathrm{T}}(\mathbf{x}-\boldsymbol{\mu})\)

Gaussian distribution into new coordinate system!

  • Jacobian \(J\) :

    \[J_{i j}=\frac{\partial x_{i}}{\partial y_{j}}=U_{j i}\]
  • determinant of \(J\) matrix :

    \[\mid\mathbf{J}\mid^{2}=\mid\mathbf{U}^{\mathrm{T}}\mid^{2}=\mid\mathbf{U}^{\mathrm{T}}\mid\mid\mathbf{U}\mid=\mid\mathbf{U}^{\mathrm{T}} \mathbf{U}\mid=\mid\mathbf{I}\mid=1\]
  • therefore…

    \[p(\mathbf{y})=p(\mathbf{x})\mid\mathbf{J}\mid=\prod_{j=1}^{D} \frac{1}{\left(2 \pi \lambda_{j}\right)^{1 / 2}} \exp \left\{-\frac{y_{j}^{2}}{2 \lambda_{j}}\right\}\]

2-3-2. Conditional Gaussian

\[\begin{aligned} \boldsymbol{\mu}_{a \mid b} &=\boldsymbol{\mu}_{a}+\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1}\left(\mathbf{x}_{b}-\boldsymbol{\mu}_{b}\right) \\ \boldsymbol{\Sigma}_{a \mid b} &=\boldsymbol{\Sigma}_{a a}-\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \boldsymbol{\Sigma}_{b a} \end{aligned}\]

2-3-3. Bayes’ theorem for Gaussian variables

marginal : \(p(\mathbf{x}) =\mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Lambda}^{-1}\right)\)

conditional : \(p(\mathbf{y} \mid \mathbf{x}) =\mathcal{N}\left(\mathbf{y} \mid \mathbf{A} \mathbf{x}+\mathbf{b}, \mathbf{L}^{-1}\right)\)

log joint : ( let \(z = (x \;\; y)^T\))

\[\begin{aligned} \ln p(\mathbf{z})=& \ln p(\mathbf{x})+\ln p(\mathbf{y} \mid \mathbf{x}) \\ =&-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\mathrm{T}} \boldsymbol{\Lambda}(\mathbf{x}-\boldsymbol{\mu}) -\frac{1}{2}(\mathbf{y}-\mathbf{A} \mathbf{x}-\mathbf{b})^{\mathrm{T}} \mathbf{L}(\mathbf{y}-\mathbf{A} \mathbf{x}-\mathbf{b})+\mathrm{const} \end{aligned}\]

second order term of \(\ln p(\mathbf{z})\) : ( to find the precision )

\[\begin{array}{l} -\frac{1}{2} \mathrm{x}^{\mathrm{T}}\left(\Lambda+\mathrm{A}^{\mathrm{T}} \mathrm{LA}\right) \mathrm{x}-\frac{1}{2} \mathrm{y}^{\mathrm{T}} \mathrm{Ly}+\frac{1}{2} \mathrm{y}^{\mathrm{T}} \mathrm{LAx}+\frac{1}{2} \mathrm{x}^{\mathrm{T}} \mathrm{A}^{\mathrm{T}} \mathrm{Ly} \\ \quad=\quad-\frac{1}{2}\left(\begin{array}{l} \mathrm{x} \\ \mathrm{y} \end{array}\right)^{\mathrm{T}}\left(\begin{array}{cc} \Lambda+\mathrm{A}^{\mathrm{T}} \mathrm{LA} & -\mathrm{A}^{\mathrm{T}} \mathrm{L} \\ -\mathrm{L} \mathrm{A} & \mathrm{L} \end{array}\right)\left(\begin{array}{l} \mathrm{x} \\ \mathrm{y} \end{array}\right)=-\frac{1}{2} \mathrm{z}^{\mathrm{T}} \mathrm{Rz} \end{array}\]

\(\therefore\) Gaussian distribution over \(z\) has the precision matrix \(R\) as below!

precision

  • \(\mathbf{R}=\left(\begin{array}{cc} \Lambda+\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{A} & -\mathbf{A}^{\mathrm{T}} \mathbf{L} \\ -\mathbf{L} \mathbf{A} & \mathbf{L} \end{array}\right)\).

covariance

  • \(\operatorname{cov}[\mathbf{z}]=\mathbf{R}^{-1}=\left(\begin{array}{cc} \mathbf{\Lambda}^{-1} & \boldsymbol{\Lambda}^{-1} \mathbf{A}^{\mathrm{T}} \\ \mathbf{A} \Lambda^{-1} & \mathbf{L}^{-1}+\mathbf{A} \mathbf{\Lambda}^{-1} \mathbf{A}^{\mathrm{T}} \end{array}\right)\).

2-3-4. Bayesian Inference for the Gaussian

Mean

(1) prior : \(p(\mu)=\mathcal{N}\left(\mu \mid \mu_{0}, \sigma_{0}^{2}\right)\)

(2) likelihood : \(p(\mathbf{X} \mid \mu)=\prod_{n=1}^{N} p\left(x_{n} \mid \mu\right)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{N / 2}} \exp \left\{-\frac{1}{2 \sigma^{2}} \sum_{n=1}^{N}\left(x_{n}-\mu\right)^{2}\right\}\)

(3) posterior : \(p(\mu \mid \mathbf{X})=\mathcal{N}\left(\mu \mid \mu_{N}, \sigma_{N}^{2}\right)\)

  • \(\mu_{N}=\frac{\sigma^{2}}{N \sigma_{0}^{2}+\sigma^{2}} \mu_{0}+\frac{N \sigma_{0}^{2}}{N \sigma_{0}^{2}+\sigma^{2}} \mu_{\mathrm{ML}}\).
  • \(\frac{1}{\sigma_{N}^{2}}=\frac{1}{\sigma_{0}^{2}}+\frac{N}{\sigma^{2}}\).

Can make sequential update!

\[p(\boldsymbol{\mu} \mid D) \propto\left[p(\boldsymbol{\mu}) \prod_{n=1}^{N-1} p\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}\right)\right] p\left(\mathbf{x}_{N} \mid \boldsymbol{\mu}\right)\]

Precision

\[p(\mathbf{X} \mid \lambda)=\prod_{n=1}^{N} \mathcal{N}\left(x_{n} \mid \mu, \lambda^{-1}\right) \propto \lambda^{N / 2} \exp \left\{-\frac{\lambda}{2} \sum_{n=1}^{N}\left(x_{n}-\mu\right)^{2}\right\}\]

conjugate prior : “Gamma distribution”

  • \(\operatorname{Gam}(\lambda \mid a, b)=\frac{1}{\Gamma(a)} b^{a} \lambda^{a-1} \exp (-b \lambda)\).

posterior distribution :

  • \[p(\lambda \mid \mathbf{X}) \propto \lambda^{a_{0}-1} \lambda^{N / 2} \exp \left\{-b_{0} \lambda-\frac{\lambda}{2} \sum_{n=1}^{N}\left(x_{n}-\mu\right)^{2}\right\}\]
  • \[\operatorname{Gam}\left(\lambda \mid a_{N}, b_{N}\right)\]
    • \(a_{N} =a_{0}+\frac{N}{2}\).
    • \(b_{N} =b_{0}+\frac{1}{2} \sum_{n=1}^{N}\left(x_{n}-\mu\right)^{2}=b_{0}+\frac{N}{2} \sigma_{\mathrm{ML}}^{2}\).

(in case of MVN)

conjugate prior of precision matrix \(\Lambda\)

  • \(\mathcal{W}(\boldsymbol{\Lambda} \mid \mathbf{W}, \nu)=B\mid\boldsymbol{\Lambda}\mid^{(\nu-D-1) / 2} \exp \left(-\frac{1}{2} \operatorname{Tr}\left(\mathbf{W}^{-1} \boldsymbol{\Lambda}\right)\right)\),

2-3-5. Student’s t-distribution

  • likelihood : univariate Gaussian ( \(\mathcal{N}\left(x \mid \mu, \tau^{-1}\right)\) )
  • prior : Gamma prior ( \(\operatorname{Gam}(\tau \mid a, b)\) )

then, integrate out the precision!

\(\operatorname{St}(x \mid \mu, \lambda, \nu)=\frac{\Gamma(\nu / 2+1 / 2)}{\Gamma(\nu / 2)}\left(\frac{\lambda}{\pi \nu}\right)^{1 / 2}\left[1+\frac{\lambda(x-\mu)^{2}}{\nu}\right]^{-\nu / 2-1 / 2}\).

\(\operatorname{St}(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Lambda}, \nu)=\int_{0}^{\infty} \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu},(\eta \boldsymbol{\Lambda})^{-1}\right) \operatorname{Gam}(\eta \mid \nu / 2, \nu / 2) \mathrm{d} \eta\).

2-3-6. Mixture of Gaussians

component & mixing coefficients

\(K\) Gaussian densities : \(p(\mathbf{x})=\sum_{k=1}^{K} \pi_{k} \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)\)


By using of sum & product rules : \(p(\mathbf{x})=\sum_{k=1}^{K} p(k) p(\mathbf{x} \mid k)\)

  • \(\pi_{k}=p(k)\).
  • \(\mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)= p(x\mid k)\).


Responsibilities \(\gamma_{k}(\mathbf{x})\) : ( \(p(k \mid \mathrm{x})\) is called “responsibility” )

\(\begin{aligned} \gamma_{k}(\mathbf{x}) & \equiv p(k \mid \mathbf{x}) \\ &=\frac{p(k) p(\mathbf{x} \mid k)}{\sum_{l} p(l) p(\mathbf{x} \mid l)} \\ &=\frac{\pi_{k} \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)}{\sum_{l} \pi_{l} \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{l}, \mathbf{\Sigma}_{l}\right)} \end{aligned}\).

2-4. The Exponential Family

\[p(\mathbf{x} \mid \boldsymbol{\eta})=h(\mathbf{x}) g(\boldsymbol{\eta}) \exp \left\{\boldsymbol{\eta}^{\mathrm{T}} \mathbf{u}(\mathbf{x})\right\}\]
  • \(\eta\) : natural parameters
  • \(\mathbf{u}(\mathbf{x})\) : some function of \(x\)

2-4-1. Maximum Likelihood and Sufficient Statistics

\[-\nabla \ln g(\boldsymbol{\eta})=\mathbb{E}[\mathbf{u}(\mathbf{x})]\] \[-\nabla \ln g\left(\boldsymbol{\eta}_{\mathrm{ML}}\right)=\frac{1}{N} \sum_{n=1}^{N} \mathbf{u}\left(\mathbf{x}_{n}\right)\]

Solution for the MLE depends on the data “only through \(\mathbf{u}\left(\mathbf{x}_{n}\right)\) “ ( = sufficient statistics )

2-4-2. Conjugate prior

\[p(\boldsymbol{\eta} \mid \boldsymbol{\chi}, \nu) \propto g(\boldsymbol{\eta})^{\nu} \exp \left\{\nu \boldsymbol{\eta}^{\mathrm{T}} \boldsymbol{\chi}\right\}\] \[p(\boldsymbol{\eta} \mid \mathbf{X}, \boldsymbol{\chi}, \nu) \propto g(\boldsymbol{\eta})^{\nu+N} \exp \left\{\boldsymbol{\eta}^{\mathrm{T}}\left(\sum_{n=1}^{N} \mathbf{u}\left(\mathbf{x}_{n}\right)+\nu \boldsymbol{\chi}\right)\right\}\]

2-4-3. Noninformative priors

if prior assigns 0 to some value \(\rightarrow\) posterior will also!

Noninformative prior

  • intended to have little influence on the posterior
  • “letting the data to speak itself”

In the case of continuous parameters….

improper : if the domain of \(\lambda\) is unbounded, prior cannot be correctly normalized

( in practice : improper priors are used when the corresponding posterior is proper )

2 examples of noninformative priors

  • (1) \(p(\mu)\) is a constant
    • location parameter : \(p(x \mid \mu)=f(x-\mu)\)
    • \(p(\mu-c)=p(\mu)\).
  • (2) \(p(\sigma) \propto 1 / \sigma\)
    • scale parameter : \(p(\sigma)=p\left(\frac{1}{c} \sigma\right) \frac{1}{c}\)
    • \(p(x \mid \sigma)=\frac{1}{\sigma} f\left(\frac{x}{\sigma}\right)\).

2-5. Nonparametric Methods

2-5-1. Kernel Density Estimators

\[K=\sum_{n=1}^{N} k\left(\frac{\mathrm{x}-\mathrm{x}_{n}}{h}\right)\]

\(h\) : role of smoothing parameters

  • small \(h\) : sensitive to noise
  • large \(h\) : over-smoothing

any other kernel function \(k(u)\)

  • \(k(\mathbf{u}) \geqslant 0\).
  • \(\int k(\mathbf{u}) \mathrm{d} \mathbf{u} =1\).