( Skip the basic parts + not important contents )

1. Introduction

1-1. The Rules of Probability

Sum rule : \(p(X)=\sum_{Y} p(X, Y)\)

Product rule : \(p(X, Y)=p(Y \mid X) p(X)\)

1-2. Variable Transformation

\[\begin{aligned} p_{y}(y) &=p_{x}(x)\mid\frac{\mathrm{d} x}{\mathrm{~d} y}\mid \\ &=p_{x}(g(y))\mid g^{\prime}(y)\mid \end{aligned}\]

1-3. Bayesian Framework

\(p(\mathbf{w} \mid \mathcal{D})=\frac{p(\mathcal{D} \mid \mathbf{w}) p(\mathbf{w})}{p(\mathcal{D})}\).

posterior \(\propto\) likelihood \(\times\) prior

1-4. Gaussian Distribution

univariate :

  • \(\mathcal{N}\left(x \mid \mu, \sigma^{2}\right)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{1 / 2}} \exp \left\{-\frac{1}{2 \sigma^{2}}(x-\mu)^{2}\right\}\).

multivariate :

  • \(\mathcal{N}(\mathrm{x} \mid \boldsymbol{\mu}, \mathbf{\Sigma})=\frac{1}{(2 \pi)^{D / 2}} \frac{1}{\mid\boldsymbol{\Sigma}\mid^{1 / 2}} \exp \left\{-\frac{1}{2}(\mathrm{x}-\boldsymbol{\mu})^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}(\mathrm{x}-\boldsymbol{\mu})\right\}\).

probability of the dataset :

  • (original) \(p\left(\mathbf{x} \mid \mu, \sigma^{2}\right)=\prod_{n=1}^{N} \mathcal{N}\left(x_{n} \mid \mu, \sigma^{2}\right)\)

  • (log) \(\ln p\left(\mathbf{x} \mid \mu, \sigma^{2}\right)=-\frac{1}{2 \sigma^{2}} \sum_{n=1}^{N}\left(x_{n}-\mu\right)^{2}-\frac{N}{2} \ln \sigma^{2}-\frac{N}{2} \ln (2 \pi)\)

1-5. Curve Fitting with Gaussian

1-5-1. without Bayesian approach

likelihood

  • (original) \(p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta)=\prod_{n=1}^{N} \mathcal{N}\left(t_{n} \mid y\left(x_{n}, \mathbf{w}\right), \beta^{-1}\right)\)

  • (log form) \(\ln p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta)=-\frac{\beta}{2} \sum_{n=1}^{N}\left\{y\left(x_{n}, \mathbf{w}\right)-t_{n}\right\}^{2}+\frac{N}{2} \ln \beta-\frac{N}{2} \ln (2 \pi)\)

MLE of \(\beta\)​ ( precision ) :

  • \(\frac{1}{\beta_{\mathrm{ML}}}=\frac{1}{N} \sum_{n=1}^{N}\left\{y\left(x_{n}, \mathbf{w}_{\mathrm{ML}}\right)-t_{n}\right\}^{2}\).

after finding MLE, we can make prediction!

predictive distribution

  • \(p\left(t \mid x, \mathbf{w}_{\mathrm{ML}}, \beta_{\mathrm{ML}}\right)=\mathcal{N}\left(t \mid y\left(x, \mathbf{w}_{\mathrm{ML}}\right), \beta_{\mathrm{ML}}^{-1}\right)\).

1-5-2. with Bayesian approach

Bayesian approach

  • bayes rule : \(p(\mathbf{w} \mid \mathbf{x}, \mathbf{t}, \alpha, \beta) \propto p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta) p(\mathbf{w} \mid \alpha)\)

  • prior : \(p(\mathbf{w} \mid \alpha)=\mathcal{N}\left(\mathbf{w} \mid \mathbf{0}, \alpha^{-1} \mathbf{I}\right)=\left(\frac{\alpha}{2 \pi}\right)^{(M+1) / 2} \exp \left\{-\frac{\alpha}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}\right\}\)

Find MAP by minimizing the loss function below

  • \(\frac{\beta}{2} \sum_{n=1}^{N}\left\{y\left(x_{n}, \mathbf{w}\right)-t_{n}\right\}^{2}+\frac{\alpha}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}\).

predictive distribution

  • \(p(t \mid x, \mathbf{x}, \mathbf{t})=\mathcal{N}\left(t \mid m(x), s^{2}(x)\right)\).

    • \(m(x) =\beta \phi(x)^{\mathrm{T}} \mathrm{S} \sum_{n=1}^{N} \phi\left(x_{n}\right) t_{n}\).

    • \(s^{2}(x) =\beta^{-1}+\phi(x)^{\mathrm{T}} \mathrm{S} \phi(x)\).

      • \(\beta^{-1}\) : uncertainty in the predicted value of \(t\) ( due to the noise )
      • \(\phi(x)^{\mathrm{T}} \mathrm{S} \phi(x)\) : uncertainty in the parameters \(w\), where \(\mathrm{S}^{-1}=\alpha \mathbf{I}+\beta \sum_{n=1}^{N} \phi\left(x_{n}\right) \phi(x)^{\mathrm{T}}\)

1-6. Model Selection

ex) K-fold CV,AIC,BIC

AIC (Akaike Information Criterion)

  • \(\ln p\left(\mathcal{D} \mid \mathbf{w}_{\mathrm{ML}}\right)-M\) .

BIC (Bayesian Information Criterion)

  • in chapter 4

\(\rightarrow\) AIC, BIC : the larger, the better

1-7. Curse of Dimensionality

\[y(\mathbf{x}, \mathbf{w})=w_{0}+\sum_{i=1}^{D} w_{i} x_{i}+\sum_{i=1}^{D} \sum_{j=1}^{D} w_{i j} x_{i} x_{j}+\sum_{i=1}^{D} \sum_{j=1}^{D} \sum_{k=1}^{D} w_{i j k} x_{i} x_{j} x_{k}\]

for a polynomial order of \(M\), the number of coefficients grows with \(D^M\)

1-8. Decision Theory

Probability theory provides us a way to quantify and manipulate uncertainty!

Inference \(\&\) Decision

  • Inference : Determination of \(p(x,t)\) from a set of training data

  • Decision : making an optimal choice

1-8-1. Minimizing the misclassification rate

  • \(\mathcal{R}_{k}\) : decision regions
  • \(\mathcal{C}_{k} .\) : class of \(k\)
\[\begin{aligned} p(\text { mistake }) &=p\left(\mathrm{x} \in \mathcal{R}_{1}, \mathcal{C}_{2}\right)+p\left(\mathrm{x} \in \mathcal{R}_{2}, \mathcal{C}_{1}\right) \\ &=\int_{\mathcal{R}_{1}} p\left(\mathrm{x}, \mathcal{C}_{2}\right) \mathrm{d} \mathrm{x}+\int_{\mathcal{R}_{2}} p\left(\mathrm{x}, \mathcal{C}_{1}\right) \mathrm{d} \mathrm{x} \end{aligned}\] \[\begin{aligned} p(\text { correct }) &=\sum_{k=1}^{K} p\left(\mathrm{x} \in \mathcal{R}_{k}, \mathcal{C}_{k}\right) \\ &=\sum_{k=1}^{K} \int_{\mathcal{R}_{k}} p\left(\mathrm{x}, \mathcal{C}_{k}\right) \mathrm{d} \mathrm{x} \end{aligned}\]

using the product rule ( \(p\left(\mathrm{x}, \mathcal{C}_{k}\right)=\) \(p\left(\mathcal{C}_{k} \mid \mathbf{x}\right) p(\mathbf{x})\) ),

  • \(p(x)\) is common
  • thus, \(x\) should be assigned to the class which maximizes the posterior probability \(p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)\)

1-8-2. Minimizing the expected loss

average loss :

  • \(\mathbb{E}[L]=\sum_{k} \sum_{j} \int_{\mathcal{R}_{j}} L_{k j} p\left(\mathbf{x}, \mathcal{C}_{k}\right) \mathrm{d} \mathbf{x}\).

  • our goal : choose the region of \(\mathcal{R}_{j}\)

  • that is, we should minimize \(\sum_{k} L_{k j} p\left(\mathrm{x}, \mathcal{C}_{k}\right)\)

    • \(p\left(\mathbf{x}, \mathcal{C}_{k}\right)=p\left(\mathcal{C}_{k} \mid \mathbf{x}\right) p(\mathbf{x})\) .
    • \(p(\mathbf{x})\) is common, so we should find \(p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)\)

1-8-3. The Rejection option

In some cases, it is appropriate to avoid making decisions

\(\rightarrow\) leaving a human expert to classify the more ambiguous cases

1-8-4. Inference and Decision

Discriminant function : function that maps inputs \(x\) directly into decisions

3 distinct approaches to solving the problems

  • approach 1)
    • [step 1] solve inference problem of determining \(p(x\mid C_k)\)
    • [step 2] use Bayes’ theorem
    • [step 3] decision theory to determine the output of the new input \(x\)
  • approach 2)
    • [step 1] solve inference problem of determining \(p(C_k \mid x)\)
    • [step 2] decision theory to determine the output of the new input \(x\)
  • approach 3)
    • [step 1] use discriminant function

1-8-5. Loss function for regression

Minkowski loss ( general loss )

  • \(\mathbb{E}\left[L_{q}\right]=\iint\mid y(\mathbf{x})-t\mid^{q} p(\mathbf{x}, t) \mathrm{d} \mathbf{x} \mathrm{d}\).

common choice : squared loss

  • \(\mathbb{E}[L]=\iint\{y(\mathbf{x})-t\}^{2} p(\mathbf{x}, t) \mathrm{d} \mathbf{x} \mathrm{d} t\).

1-9. Information Theory

1-9-1. Introduction

the information gain from observing both = sum of the each

  • \(h(x, y)=h(x)+h(y)\),
  • \(h(x)\) must be given by the logarithm of \(p(x)\)

quantity : \(h(x)=-\log _{2} p(x)\)

entropy : \(\mathrm{H}[x]=-\sum_{x} p(x) \log _{2} p(x)\)

  • interpretation : average amount of information

with normalization constraint

  • use Lagrange multiplier
  • \(\widetilde{\mathrm{H}}=-\sum_{i} p\left(x_{i}\right) \ln p\left(x_{i}\right)+\lambda\left(\sum_{i} p\left(x_{i}\right)-1\right)\).

quantize continuous variable

  • ( mean value theorem ) \(\int_{i \Delta}^{(i+1) \Delta} p(x) \mathrm{d} x=p\left(x_{i}\right) \Delta\)
  • discretize : \(\mathrm{H}_{\Delta}=-\sum_{i} p\left(x_{i}\right) \Delta \ln \left(p\left(x_{i}\right) \Delta\right)=-\sum_{i} p\left(x_{i}\right) \Delta \ln p\left(x_{i}\right)-\ln \Delta\)
  • \(\lim _{\Delta \rightarrow 0}\left\{\sum_{i} p\left(x_{i}\right) \Delta \ln p\left(x_{i}\right)\right\}=-\int p(x) \ln p(x) \mathrm{d} x\).

differential entropy of the Gaussian : \(\mathrm{H}[x]=\frac{1}{2}\left\{1+\ln \left(2 \pi \sigma^{2}\right)\right\}\)

Conditional Entropy

( amount of additional Information needed )

  • given by \(-\ln p(\mathbf{y} \mid \mathrm{x})\)

  • average additional information needed to specify \(y\)

    = \(\mathrm{H}[\mathbf{y} \mid \mathbf{x}]=-\iint p(\mathbf{y}, \mathbf{x}) \ln p(\mathbf{y} \mid \mathbf{x}) \mathrm{d} \mathbf{y} \mathrm{d} \mathbf{x}\)

  • \(\mathrm{H}[\mathrm{x}, \mathrm{y}]=\mathrm{H}[\mathrm{y} \mid \mathrm{x}]+\mathrm{H}[\mathrm{x}]\).

1-9-2. Relative Entropy and Mutual Information

(1) Relative Entropy ( = KL divergence ) :

  • unknown \(p(x)\) , approximating distribution \(q(x)\)

  • average additional amount needed : \(\mathrm{H}[\mathrm{y} \mid \mathrm{x}] = \mathrm{H}[\mathrm{x}, \mathrm{y}] -\mathrm{H}[\mathrm{x}]\)

  • KL Divergence :

    \[\begin{aligned} \mathrm{KL}(p \mid q) &=-\int p(\mathrm{x}) \ln q(\mathrm{x}) \mathrm{d} \mathrm{x}-\left(-\int p(\mathrm{x}) \ln p(\mathrm{x}) \mathrm{d} \mathrm{x}\right) \\ &=-\int p(\mathrm{x}) \ln \left\{\frac{q(\mathrm{x})}{p(\mathrm{x})}\right\} \mathrm{d} \mathrm{x} \end{aligned}\]

    ( approximate with finite sum )

    \[\mathrm{KL}(p \\mid q) \simeq \sum_{n=1}^{N}\left\{-\ln q\left(\mathbf{x}_{n} \mid \boldsymbol{\theta}\right)+\ln p\left(\mathbf{x}_{n}\right)\right\}\]
  • Thus, minimizing KLD = maximizing likelihood function

(2) Mutual Information

  • \(\mathrm{I}[\mathrm{x}, \mathrm{y}]=\mathrm{H}[\mathrm{x}]-\mathrm{H}[\mathrm{x} \mid \mathrm{y}]=\mathrm{H}[\mathrm{y}]-\mathrm{H}[\mathrm{y} \mid \mathrm{x}]\).

    \[\begin{aligned} \mathrm{I}[\mathrm{x}, \mathrm{y}] & \equiv \mathrm{KL}(p(\mathrm{x}, \mathrm{y}) \mid p(\mathrm{x}) p(\mathrm{y})) \\ &=-\iint p(\mathrm{x}, \mathrm{y}) \ln \left(\frac{p(\mathrm{x}) p(\mathrm{y})}{p(\mathrm{x}, \mathrm{y})}\right) \mathrm{d} \mathrm{x} \mathrm{d} \mathrm{y} \end{aligned}\]