( Skip the basic parts + not important contents )
1. Introduction
1-1. The Rules of Probability
Sum rule : \(p(X)=\sum_{Y} p(X, Y)\)
Product rule : \(p(X, Y)=p(Y \mid X) p(X)\)
1-2. Variable Transformation
\[\begin{aligned} p_{y}(y) &=p_{x}(x)\mid\frac{\mathrm{d} x}{\mathrm{~d} y}\mid \\ &=p_{x}(g(y))\mid g^{\prime}(y)\mid \end{aligned}\]1-3. Bayesian Framework
\(p(\mathbf{w} \mid \mathcal{D})=\frac{p(\mathcal{D} \mid \mathbf{w}) p(\mathbf{w})}{p(\mathcal{D})}\).
posterior \(\propto\) likelihood \(\times\) prior
1-4. Gaussian Distribution
univariate :
- \(\mathcal{N}\left(x \mid \mu, \sigma^{2}\right)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{1 / 2}} \exp \left\{-\frac{1}{2 \sigma^{2}}(x-\mu)^{2}\right\}\).
multivariate :
- \(\mathcal{N}(\mathrm{x} \mid \boldsymbol{\mu}, \mathbf{\Sigma})=\frac{1}{(2 \pi)^{D / 2}} \frac{1}{\mid\boldsymbol{\Sigma}\mid^{1 / 2}} \exp \left\{-\frac{1}{2}(\mathrm{x}-\boldsymbol{\mu})^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}(\mathrm{x}-\boldsymbol{\mu})\right\}\).
probability of the dataset :
-
(original) \(p\left(\mathbf{x} \mid \mu, \sigma^{2}\right)=\prod_{n=1}^{N} \mathcal{N}\left(x_{n} \mid \mu, \sigma^{2}\right)\)
-
(log) \(\ln p\left(\mathbf{x} \mid \mu, \sigma^{2}\right)=-\frac{1}{2 \sigma^{2}} \sum_{n=1}^{N}\left(x_{n}-\mu\right)^{2}-\frac{N}{2} \ln \sigma^{2}-\frac{N}{2} \ln (2 \pi)\)
1-5. Curve Fitting with Gaussian
1-5-1. without Bayesian approach
likelihood
-
(original) \(p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta)=\prod_{n=1}^{N} \mathcal{N}\left(t_{n} \mid y\left(x_{n}, \mathbf{w}\right), \beta^{-1}\right)\)
-
(log form) \(\ln p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta)=-\frac{\beta}{2} \sum_{n=1}^{N}\left\{y\left(x_{n}, \mathbf{w}\right)-t_{n}\right\}^{2}+\frac{N}{2} \ln \beta-\frac{N}{2} \ln (2 \pi)\)
MLE of \(\beta\) ( precision ) :
- \(\frac{1}{\beta_{\mathrm{ML}}}=\frac{1}{N} \sum_{n=1}^{N}\left\{y\left(x_{n}, \mathbf{w}_{\mathrm{ML}}\right)-t_{n}\right\}^{2}\).
after finding MLE, we can make prediction!
predictive distribution
- \(p\left(t \mid x, \mathbf{w}_{\mathrm{ML}}, \beta_{\mathrm{ML}}\right)=\mathcal{N}\left(t \mid y\left(x, \mathbf{w}_{\mathrm{ML}}\right), \beta_{\mathrm{ML}}^{-1}\right)\).
1-5-2. with Bayesian approach
Bayesian approach
-
bayes rule : \(p(\mathbf{w} \mid \mathbf{x}, \mathbf{t}, \alpha, \beta) \propto p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta) p(\mathbf{w} \mid \alpha)\)
-
prior : \(p(\mathbf{w} \mid \alpha)=\mathcal{N}\left(\mathbf{w} \mid \mathbf{0}, \alpha^{-1} \mathbf{I}\right)=\left(\frac{\alpha}{2 \pi}\right)^{(M+1) / 2} \exp \left\{-\frac{\alpha}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}\right\}\)
Find MAP by minimizing the loss function below
- \(\frac{\beta}{2} \sum_{n=1}^{N}\left\{y\left(x_{n}, \mathbf{w}\right)-t_{n}\right\}^{2}+\frac{\alpha}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}\).
predictive distribution
-
\(p(t \mid x, \mathbf{x}, \mathbf{t})=\mathcal{N}\left(t \mid m(x), s^{2}(x)\right)\).
-
\(m(x) =\beta \phi(x)^{\mathrm{T}} \mathrm{S} \sum_{n=1}^{N} \phi\left(x_{n}\right) t_{n}\).
-
\(s^{2}(x) =\beta^{-1}+\phi(x)^{\mathrm{T}} \mathrm{S} \phi(x)\).
- \(\beta^{-1}\) : uncertainty in the predicted value of \(t\) ( due to the noise )
- \(\phi(x)^{\mathrm{T}} \mathrm{S} \phi(x)\) : uncertainty in the parameters \(w\), where \(\mathrm{S}^{-1}=\alpha \mathbf{I}+\beta \sum_{n=1}^{N} \phi\left(x_{n}\right) \phi(x)^{\mathrm{T}}\)
-
1-6. Model Selection
ex) K-fold CV,AIC,BIC
AIC (Akaike Information Criterion)
- \(\ln p\left(\mathcal{D} \mid \mathbf{w}_{\mathrm{ML}}\right)-M\) .
BIC (Bayesian Information Criterion)
- in chapter 4
\(\rightarrow\) AIC, BIC : the larger, the better
1-7. Curse of Dimensionality
\[y(\mathbf{x}, \mathbf{w})=w_{0}+\sum_{i=1}^{D} w_{i} x_{i}+\sum_{i=1}^{D} \sum_{j=1}^{D} w_{i j} x_{i} x_{j}+\sum_{i=1}^{D} \sum_{j=1}^{D} \sum_{k=1}^{D} w_{i j k} x_{i} x_{j} x_{k}\]for a polynomial order of \(M\), the number of coefficients grows with \(D^M\)
1-8. Decision Theory
Probability theory provides us a way to quantify and manipulate uncertainty!
Inference \(\&\) Decision
-
Inference : Determination of \(p(x,t)\) from a set of training data
-
Decision : making an optimal choice
1-8-1. Minimizing the misclassification rate
- \(\mathcal{R}_{k}\) : decision regions
- \(\mathcal{C}_{k} .\) : class of \(k\)
using the product rule ( \(p\left(\mathrm{x}, \mathcal{C}_{k}\right)=\) \(p\left(\mathcal{C}_{k} \mid \mathbf{x}\right) p(\mathbf{x})\) ),
- \(p(x)\) is common
- thus, \(x\) should be assigned to the class which maximizes the posterior probability \(p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)\)
1-8-2. Minimizing the expected loss
average loss :
-
\(\mathbb{E}[L]=\sum_{k} \sum_{j} \int_{\mathcal{R}_{j}} L_{k j} p\left(\mathbf{x}, \mathcal{C}_{k}\right) \mathrm{d} \mathbf{x}\).
-
our goal : choose the region of \(\mathcal{R}_{j}\)
-
that is, we should minimize \(\sum_{k} L_{k j} p\left(\mathrm{x}, \mathcal{C}_{k}\right)\)
- \(p\left(\mathbf{x}, \mathcal{C}_{k}\right)=p\left(\mathcal{C}_{k} \mid \mathbf{x}\right) p(\mathbf{x})\) .
- \(p(\mathbf{x})\) is common, so we should find \(p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)\)
1-8-3. The Rejection option
In some cases, it is appropriate to avoid making decisions
\(\rightarrow\) leaving a human expert to classify the more ambiguous cases
1-8-4. Inference and Decision
Discriminant function : function that maps inputs \(x\) directly into decisions
3 distinct approaches to solving the problems
- approach 1)
- [step 1] solve inference problem of determining \(p(x\mid C_k)\)
- [step 2] use Bayes’ theorem
- [step 3] decision theory to determine the output of the new input \(x\)
- approach 2)
- [step 1] solve inference problem of determining \(p(C_k \mid x)\)
- [step 2] decision theory to determine the output of the new input \(x\)
- approach 3)
- [step 1] use discriminant function
1-8-5. Loss function for regression
Minkowski loss ( general loss )
- \(\mathbb{E}\left[L_{q}\right]=\iint\mid y(\mathbf{x})-t\mid^{q} p(\mathbf{x}, t) \mathrm{d} \mathbf{x} \mathrm{d}\).
common choice : squared loss
- \(\mathbb{E}[L]=\iint\{y(\mathbf{x})-t\}^{2} p(\mathbf{x}, t) \mathrm{d} \mathbf{x} \mathrm{d} t\).
1-9. Information Theory
1-9-1. Introduction
the information gain from observing both = sum of the each
- \(h(x, y)=h(x)+h(y)\),
- \(h(x)\) must be given by the logarithm of \(p(x)\)
quantity : \(h(x)=-\log _{2} p(x)\)
entropy : \(\mathrm{H}[x]=-\sum_{x} p(x) \log _{2} p(x)\)
- interpretation : average amount of information
with normalization constraint
- use Lagrange multiplier
- \(\widetilde{\mathrm{H}}=-\sum_{i} p\left(x_{i}\right) \ln p\left(x_{i}\right)+\lambda\left(\sum_{i} p\left(x_{i}\right)-1\right)\).
quantize continuous variable
- ( mean value theorem ) \(\int_{i \Delta}^{(i+1) \Delta} p(x) \mathrm{d} x=p\left(x_{i}\right) \Delta\)
- discretize : \(\mathrm{H}_{\Delta}=-\sum_{i} p\left(x_{i}\right) \Delta \ln \left(p\left(x_{i}\right) \Delta\right)=-\sum_{i} p\left(x_{i}\right) \Delta \ln p\left(x_{i}\right)-\ln \Delta\)
- \(\lim _{\Delta \rightarrow 0}\left\{\sum_{i} p\left(x_{i}\right) \Delta \ln p\left(x_{i}\right)\right\}=-\int p(x) \ln p(x) \mathrm{d} x\).
differential entropy of the Gaussian : \(\mathrm{H}[x]=\frac{1}{2}\left\{1+\ln \left(2 \pi \sigma^{2}\right)\right\}\)
Conditional Entropy
( amount of additional Information needed )
-
given by \(-\ln p(\mathbf{y} \mid \mathrm{x})\)
-
average additional information needed to specify \(y\)
= \(\mathrm{H}[\mathbf{y} \mid \mathbf{x}]=-\iint p(\mathbf{y}, \mathbf{x}) \ln p(\mathbf{y} \mid \mathbf{x}) \mathrm{d} \mathbf{y} \mathrm{d} \mathbf{x}\)
-
\(\mathrm{H}[\mathrm{x}, \mathrm{y}]=\mathrm{H}[\mathrm{y} \mid \mathrm{x}]+\mathrm{H}[\mathrm{x}]\).
1-9-2. Relative Entropy and Mutual Information
(1) Relative Entropy ( = KL divergence ) :
-
unknown \(p(x)\) , approximating distribution \(q(x)\)
-
average additional amount needed : \(\mathrm{H}[\mathrm{y} \mid \mathrm{x}] = \mathrm{H}[\mathrm{x}, \mathrm{y}] -\mathrm{H}[\mathrm{x}]\)
-
KL Divergence :
\[\begin{aligned} \mathrm{KL}(p \mid q) &=-\int p(\mathrm{x}) \ln q(\mathrm{x}) \mathrm{d} \mathrm{x}-\left(-\int p(\mathrm{x}) \ln p(\mathrm{x}) \mathrm{d} \mathrm{x}\right) \\ &=-\int p(\mathrm{x}) \ln \left\{\frac{q(\mathrm{x})}{p(\mathrm{x})}\right\} \mathrm{d} \mathrm{x} \end{aligned}\]( approximate with finite sum )
\[\mathrm{KL}(p \\mid q) \simeq \sum_{n=1}^{N}\left\{-\ln q\left(\mathbf{x}_{n} \mid \boldsymbol{\theta}\right)+\ln p\left(\mathbf{x}_{n}\right)\right\}\] -
Thus, minimizing KLD = maximizing likelihood function
(2) Mutual Information
-
\(\mathrm{I}[\mathrm{x}, \mathrm{y}]=\mathrm{H}[\mathrm{x}]-\mathrm{H}[\mathrm{x} \mid \mathrm{y}]=\mathrm{H}[\mathrm{y}]-\mathrm{H}[\mathrm{y} \mid \mathrm{x}]\).
\[\begin{aligned} \mathrm{I}[\mathrm{x}, \mathrm{y}] & \equiv \mathrm{KL}(p(\mathrm{x}, \mathrm{y}) \mid p(\mathrm{x}) p(\mathrm{y})) \\ &=-\iint p(\mathrm{x}, \mathrm{y}) \ln \left(\frac{p(\mathrm{x}) p(\mathrm{y})}{p(\mathrm{x}, \mathrm{y})}\right) \mathrm{d} \mathrm{x} \mathrm{d} \mathrm{y} \end{aligned}\]