( Skip the basic parts + not important contents )
4. Linear Models for Classification
discriminant function :
directly assigns each vector \(x\) to a specific class
\[y(\mathbf{x})=f\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}+w_{0}\right)\]- \(f(\cdot)\) : activation function
- \(f^{-1}(\cdot)\) : link function
4-1. Discriminant Functions
linear discriminant function : (total of \(K\) classes )
\(y_{k}(\mathrm{x})=\mathbf{w}_{k}^{\mathrm{T}} \mathbf{x}+w_{k 0}\)
4-2. Probabilistic Generative Models
skip
4-3. Probabilistic Discriminative Models
4-3-1. Fixed basis function
fixed nonlinear transformation of the inputs using a vector of basis functions \(\phi(x)\)
4-3-2. Logistic Regression
\(p\left(\mathcal{C}_{1} \mid \phi\right)=y(\phi)=\sigma\left(\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\right)\) , where \(\frac{d \sigma}{d a}=\sigma(1-\sigma)\)
Likelihood function : \(p(\mathbf{t} \mid \mathbf{w})=\prod_{n=1}^{N} y_{n}^{t_{n}}\left\{1-y_{n}\right\}^{1-t_{n}}\)
Cross-entropy error : \(E(\mathbf{w})=-\ln p(\mathbf{t} \mid \mathbf{w})=-\sum_{n=1}^{N}\left\{t_{n} \ln y_{n}+\left(1-t_{n}\right) \ln \left(1-y_{n}\right)\right\}\)
- \(y_n = \sigma(a_n)\).
- \(a_n = w^T\phi_n\).
Gradient of cross-entropy error : \(\nabla E(\mathbf{w})=\sum_{n=1}^{N}\left(y_{n}-t_{n}\right) \phi_{n}\)
4-3-3. Iterative reweighted least squares (IRLS)
Introduction
error function : minimized by Newton-Raphson iterative optimization scheme
\[\mathbf{w}^{(\text {new })}=\mathbf{w}^{(\text {old })}-\mathbf{H}^{-1} \nabla E(\mathbf{w})\]\(H\) : Hessian matrix
( whose elements comprise the “second derivative” of \(E(w)\) w.r.t components of \(w\) )
\[\begin{aligned} \nabla E(\mathbf{w}) &=\sum_{n=1}^{N}\left(\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}_{n}-t_{n}\right) \boldsymbol{\phi}_{n}=\boldsymbol{\Phi}^{\mathrm{T}} \boldsymbol{\Phi} \mathbf{w}-\boldsymbol{\Phi}^{\mathrm{T}} \mathbf{t} \\ \mathbf{H}=\nabla \nabla E(\mathbf{w}) &=\sum_{n=1}^{N} \phi_{n} \phi_{n}^{\mathrm{T}}=\boldsymbol{\Phi}^{\mathrm{T}} \boldsymbol{\Phi} \end{aligned}\]Therefore… updating equation will become :
\[\begin{aligned} \mathbf{w}^{(\text {new })} &=\mathbf{w}^{(\text {old })}-\left(\Phi^{\mathrm{T}} \Phi\right)^{-1}\left\{\Phi^{\mathrm{T}} \Phi \mathrm{w}^{(\text {old })}-\Phi^{\mathrm{T}} \mathbf{t}\right\} \\ &=\left(\Phi^{\mathrm{T}} \Phi\right)^{-1} \Phi^{\mathrm{T}} \mathbf{t} \end{aligned}\]Apply this to “logistic model”
\[\begin{aligned} \nabla E(\mathbf{w}) &=\sum_{n=1}^{N}\left(y_{n}-t_{n}\right) \phi_{n}=\Phi^{\mathrm{T}}(\mathbf{y}-\mathbf{t}) \\ \mathbf{H} &=\nabla \nabla E(\mathbf{w})=\sum_{n=1}^{N} y_{n}\left(1-y_{n}\right) \phi_{n} \phi_{n}^{\mathrm{T}}=\Phi^{\mathrm{T}} \mathbf{R} \Phi \end{aligned}\]where \(R_{n n}=y_{n}\left(1-y_{n}\right)\)
Updating equation
\[\begin{aligned} \mathbf{w}^{(\mathrm{new})} &=\mathbf{w}^{(\text {old })}-\left(\Phi^{\mathrm{T}} \mathbf{R} \Phi\right)^{-1} \Phi^{\mathrm{T}}(\mathbf{y}-\mathbf{t}) \\ &=\left(\Phi^{\mathrm{T}} \mathbf{R} \Phi\right)^{-1}\left\{\Phi^{\mathrm{T}} \mathbf{R} \Phi \mathbf{w}^{(\mathrm{old})}-\Phi^{\mathrm{T}}(\mathbf{y}-\mathbf{t})\right\} \\ &=\left(\Phi^{\mathrm{T}} \mathbf{R} \Phi\right)^{-1} \Phi^{\mathrm{T}} \mathbf{R} \mathbf{z} \end{aligned}\]where \(\mathbf{z}=\Phi \mathbf{w}^{(\text {old })}-\mathbf{R}^{-1}(\mathbf{y}-\mathbf{t})\)
4-3-4. Multiclass logistic regression
\[p\left(\mathcal{C}_{k} \mid \phi\right)=y_{k}(\phi)=\frac{\exp \left(a_{k}\right)}{\sum_{j} \exp \left(a_{j}\right)}\] where activation function is \(a_{k}=\mathbf{w}_{k}^{\mathrm{T}} \phi\)
Multiclass logistic regression
-
likelihood function : \(p\left(\mathbf{T} \mid \mathbf{w}_{1}, \ldots, \mathbf{w}_{K}\right)=\prod_{n=1}^{N} \prod_{k=1}^{K} p\left(\mathcal{C}_{k} \mid \phi_{n}\right)^{t_{n k}}=\prod_{n=1}^{N} \prod_{k=1}^{K} y_{n k}^{t_{n k}}\)
-
(NLL) negative log likelihood : \(E\left(\mathbf{w}_{1}, \ldots, \mathbf{w}_{K}\right)=-\ln p\left(\mathbf{T} \mid \mathbf{w}_{1}, \ldots, \mathbf{w}_{K}\right)=-\sum_{n=1}^{N} \sum_{k=1}^{K} t_{n k} \ln y_{n k}\)
( = cross entropy error )
-
derivative of NLL : \(\nabla_{\mathbf{w}_{j}} E\left(\mathbf{w}_{1}, \ldots, \mathbf{w}_{K}\right)=\sum_{n=1}^{N}\left(y_{n j}-t_{n j}\right) \phi_{n}\)
4-3-5. Probit regression
probit function : \(\Phi(a)=\int_{-\infty}^{a} \mathcal{N}(\theta \mid 0,1) \mathrm{d} \theta\)
- erf function : \(\operatorname{erf}(a)=\frac{2}{\sqrt{\pi}} \int_{0}^{a} \exp \left(-\theta^{2} / 2\right) \mathrm{d} \theta\)
- re-express using erf : \(\Phi(a)=\frac{1}{2}\left\{1+\frac{1}{\sqrt{2}} \operatorname{erf}(a)\right\}\)
4-4. Laplace Approximation
aims to find a Gaussian approximation to a probability density
( find Gaussian approximation \(q(z)\) which is centerd on a mode of \(p(z)\) )
(a) 1-dim
step 1) find a mode of \(p(z)\)
- \[\frac{d f(z)}{d z}\mid_{z=z_{0}}=0\]
step 2) use Taylor series expansion
-
\(\ln f(z) \simeq \ln f\left(z_{0}\right)-\frac{1}{2} A\left(z-z_{0}\right)^{2}\), where \(A =-\frac{d^{2}}{d z^{2}} \ln f(z)\mid_{z=z_{0}}\)
-
take exponential
\[f(z) \simeq f\left(z_{0}\right) \exp \left\{-\frac{A}{2}\left(z-z_{0}\right)^{2}\right\}\] -
normalize it
\[q(z)=\left(\frac{A}{2 \pi}\right)^{1 / 2} \exp \left\{-\frac{A}{2}\left(z-z_{0}\right)^{2}\right\}\]
(b) M-dim
step 1) find a mode of \(p(z)\)
step 2) use Taylor series expansion
-
\[\ln f(\mathbf{z}) \simeq \ln f\left(\mathbf{z}_{0}\right)-\frac{1}{2}\left(\mathbf{z}-\mathbf{z}_{0}\right)^{\mathrm{T}} \mathbf{A}\left(\mathbf{z}-\mathbf{z}_{0}\right)\]
where \(\mathbf{A}=-\nabla \nabla \ln f(\mathbf{z})\mid_{\mathbf{z}=\mathbf{z}_{0}}\) ( \(H\) x \(H\) matrix )
-
take exponential
\[f(\mathbf{z}) \simeq f\left(\mathbf{z}_{0}\right) \exp \left\{-\frac{1}{2}\left(\mathbf{z}-\mathbf{z}_{0}\right)^{\mathrm{T}} \mathbf{A}\left(\mathbf{z}-\mathbf{z}_{0}\right)\right\}\] -
normalize it
\[q(\mathbf{z})=\frac{\mid\mathbf{A}\mid^{1 / 2}}{(2 \pi)^{M / 2}} \exp \left\{-\frac{1}{2}\left(\mathbf{z}-\mathbf{z}_{0}\right)^{\mathrm{T}} \mathbf{A}\left(\mathbf{z}-\mathbf{z}_{0}\right)\right\}=\mathcal{N}\left(\mathbf{z} \mid \mathbf{z}_{0}, \mathbf{A}^{-1}\right)\]
Summary :
in order to apply Laplace approximation…
- step 1) find the mode \(z_0\)
- step 2) evaluate the Hessian matrix at mode
4-5. Bayesian Logistic regression
Exact Bayesian inference for logistic regression is intractable!
\(\rightarrow\) use Laplace approximation
4-5-1. Laplace approximation
prior (Gaussian) : \(p(\mathbf{w})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{m}_{0}, \mathbf{S}_{0}\right)\)
likelihood : \(p(\mathbf{t} \mid \mathbf{w})=\prod_{n=1}^{N} y_{n}^{t_{n}}\left\{1-y_{n}\right\}^{1-t_{n}}\)
(log) Posterior :
\[\begin{aligned} \ln p(\mathbf{w} \mid \mathbf{t})=&-\frac{1}{2}\left(\mathbf{w}-\mathbf{m}_{0}\right)^{\mathrm{T}} \mathbf{S}_{0}^{-1}\left(\mathbf{w}-\mathbf{m}_{0}\right) \\ &+\sum_{n=1}^{N}\left\{t_{n} \ln y_{n}+\left(1-t_{n}\right) \ln \left(1-y_{n}\right)\right\}+\mathrm{const} \end{aligned}\]covariance
- inverse of the matrix of second derivative of NLL
- \(\mathbf{S}_{N}=-\nabla \nabla \ln p(\mathbf{w} \mid \mathbf{t})=\mathbf{S}_{0}^{-1}+\sum_{n=1}^{N} y_{n}\left(1-y_{n}\right) \phi_{n} \phi_{n}^{\mathrm{T}}\).
Result : \(q(\mathbf{w})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{w}_{\mathrm{MAP}}, \mathbf{S}_{N}\right)\)
4-5-2. Predictive Distribution
Predictive distribution for class \(\mathcal{C}_{1}\)
\[p\left(\mathcal{C}_{1} \mid \phi, \mathbf{t}\right)=\int p\left(\mathcal{C}_{1} \mid \phi, \mathbf{w}\right) p(\mathbf{w} \mid \mathbf{t}) \mathrm{d} \mathbf{w} \simeq \int \sigma\left(\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\right) q(\mathbf{w}) \mathrm{d} \mathbf{w}\]-
\(p\left(\mathcal{C}_{2} \mid \phi, \mathbf{t}\right)=1-p\left(\mathcal{C}_{1} \mid \phi, \mathbf{t}\right)\).
-
denote \(a=\mathrm{w}^{\mathrm{T}} \phi\)
\(\sigma\left(\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\right)=\int \delta\left(a-\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\right) \sigma(a) \mathrm{d} a\).
- dirac delta function : \(\delta(x)=\left\{\begin{array}{ll} +\infty, & x=0 \\ 0, & x \neq 0 \end{array}\right.\)
Therefore, \(\int \sigma\left(\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\right) q(\mathbf{w}) \mathrm{d} \mathbf{w}=\int \sigma(a) p(a) \mathrm{d} a\)
- \(p(a)=\int \delta\left(a-\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\right) q(\mathbf{w}) \mathrm{d} \mathbf{w}\) .
- \(\mu_{a}=\mathbb{E}[a]=\int p(a) a \mathrm{~d} a=\int q(\mathrm{w}) \mathrm{w}^{\mathrm{T}} \phi \mathrm{d} \mathbf{w}=\mathrm{w}_{\mathrm{MAP}}^{\mathrm{T}} \phi\).
- \(\begin{aligned} \sigma_{a}^{2} &=\operatorname{var}[a]=\int p(a)\left\{a^{2}-\mathbb{E}[a]^{2}\right\} \mathrm{d} a \\ &=\int q(\mathbf{w})\left\{\left(\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\right)^{2}-\left(\mathbf{m}_{N}^{\mathrm{T}} \boldsymbol{\phi}\right)^{2}\right\} \mathrm{d} \mathbf{w}=\boldsymbol{\phi}^{\mathrm{T}} \mathbf{S}_{N} \phi \end{aligned}\).
Therefore, \(p\left(\mathcal{C}_{1} \mid \mathbf{t}\right)=\int \sigma(a) p(a) \mathrm{d} a=\int \sigma(a) \mathcal{N}\left(a \mid \mu_{a}, \sigma_{a}^{2}\right) \mathrm{d} a\)
-
integral over \(a\) : Gaussian with a logistic sigmoid, and can not be evaluated analytically
-
approximate logistic sigmoid \(\sigma(a)\) with probit function \(\Phi(\lambda a)\)
( where \(\lambda^2 = \pi/8\))
-
[Tip] why use probit function?
- its convolution with a Gaussian can be expressed analytically in terms for another probit function
- \(\int \Phi(\lambda a) \mathcal{N}\left(a \mid \mu, \sigma^{2}\right) \mathrm{d} a=\Phi\left(\frac{\mu}{\left(\lambda^{-2}+\sigma^{2}\right)^{1 / 2}}\right)\).
Therefore, \(p\left(\mathcal{C}_{1} \mid \mathbf{t}\right)=\int \sigma(a) \mathcal{N}\left(a \mid \mu_{a}, \sigma_{a}^{2}\right) \mathrm{d} a \simeq \sigma\left(\kappa\left(\sigma^{2}\right) \mu\right)\)
- \(\kappa\left(\sigma^{2}\right)=\left(1+\pi \sigma^{2} / 8\right)^{-1 / 2}\).
[Result] approximate predictive distribution :
\[p\left(\mathcal{C}_{1} \mid \phi, \mathbf{t}\right)=\sigma\left(\kappa\left(\sigma_{a}^{2}\right) \mu_{a}\right)\]- \(\mu_{a}=\mathrm{w}_{\mathrm{MAP}}^{\mathrm{T}} \phi\).
- \(\sigma_{a}^{2} =\boldsymbol{\phi}^{\mathrm{T}} \mathbf{S}_{N} \phi\).