( Skip the basic parts + not important contents )
3. Linear Models for Regression
3-1. Linear Basis Function Models
\[y(\mathbf{x}, \mathbf{w})=\sum_{j=0}^{M-1} w_{j} \phi_{j}(\mathbf{x})=\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{x})\]- \(\phi_j(x)\) : basis function
Example of basis functions
-
spline functions : divide the input space up into regions & fit a different polynomial for each
-
Gaussian basis function (rbf) :
\[\phi_{j}(x)=\exp \left\{-\frac{\left(x-\mu_{j}\right)^{2}}{2 s^{2}}\right\}\] -
Sigmoidal basis function :
\(\phi_{j}(x)=\sigma\left(\frac{x-\mu_{j}}{s}\right)\) where \(\sigma(a)=\frac{1}{1+\exp (-a)}\)
3-1-1. MLE and Least squares
\[t=y(\mathbf{x}, \mathbf{w})+\epsilon\]-
likelihood : \(p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta)=\prod_{n=1}^{N} \mathcal{N}\left(t_{n} \mid \mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right), \beta^{-1}\right)\)
-
log likelihood : \(\begin{aligned} \ln p(\mathbf{t} \mid \mathbf{w}, \beta) &=\sum_{n=1}^{N} \ln \mathcal{N}\left(t_{n} \mid \mathbf{w}^{\mathrm{T}} \phi\left(\mathbf{x}_{n}\right), \beta^{-1}\right) =\frac{N}{2} \ln \beta-\frac{N}{2} \ln (2 \pi)-\beta E_{D}(\mathbf{w}) \end{aligned}\)
where \(E_{D}(\mathbf{w})\) is SSE function , \(E_{D}(\mathbf{w})=\frac{1}{2} \sum_{n=1}^{N}\left\{t_{n}-\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)\right\}^{2}\)
If we solve ML : ( maximize log likelihood )
\[\mathbf{w}_{\mathrm{ML}}=\left(\mathbf{\Phi}^{\mathrm{T}} \mathbf{\Phi}\right)^{-1} \mathbf{\Phi}^{\mathrm{T}} \mathbf{t}\]where \(\Phi=\left(\begin{array}{cccc} \phi_{0}\left(\mathrm{x}_{1}\right) & \phi_{1}\left(\mathrm{x}_{1}\right) & \cdots & \phi_{M-1}\left(\mathrm{x}_{1}\right) \\ \phi_{0}\left(\mathrm{x}_{2}\right) & \phi_{1}\left(\mathrm{x}_{2}\right) & \cdots & \phi_{M-1}\left(\mathrm{x}_{2}\right) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_{0}\left(\mathrm{x}_{N}\right) & \phi_{1}\left(\mathrm{x}_{N}\right) & \cdots & \phi_{M-1}\left(\mathrm{x}_{N}\right) \end{array}\right)\)
3-1-2. Sequential Learning
\[\begin{aligned}\mathbf{w}^{(\tau+1)}&=\mathbf{w}^{(\tau)}-\eta \nabla E_{n} \\&= \mathbf{w}^{(\tau)}+\eta\left(t_{n}-\mathbf{w}^{(\tau) \mathrm{T}} \boldsymbol{\phi}_{n}\right) \boldsymbol{\phi}_{n}\end{aligned}\]3-1-3. Regularized Least Squares
to control overfitting! ( use weight decay )
total error = \(E_{D}(\mathbf{w})+\lambda E_{W}(\mathbf{w})\)
ex) \(E_{W}(\mathbf{w})=\frac{1}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}\)
-
then, total error function : \(\frac{1}{2} \sum_{n=1}^{N}\left\{t_{n}-\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)\right\}^{2}+\frac{\lambda}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}\)
-
solution : \(\mathbf{w}=\left(\lambda \mathbf{I}+\mathbf{\Phi}^{\mathrm{T}} \mathbf{\Phi}\right)^{-1} \mathbf{\Phi}^{\mathrm{T}} \mathbf{t}\)
general regularizer
\(\frac{1}{2} \sum_{n=1}^{N}\{t_{n}-\mathbf{w}^{\mathrm{T}}\boldsymbol{\phi}(\mathbf{x}_{n})\}^{2}+\frac{\lambda}{2} \sum_{j=1}^{M}\mid w_{j}\mid^{q}\).
- \(q=1\) : Lasso
- \(q=2\) : Ridge
Minimizing the “regularized” loss function
= Minimizing “un-regularized” loss function
\(+\) Constraint ( use Lagrange multipliers ) \(\sum_{j=1}^{M}\mid w_{j}\mid^{q} \leqslant \eta\)
3-2. Bias-Variance Decomposition
\[\begin{array}{l} \mathbb{E}_{\mathcal{D}}\left[\{y(\mathbf{x} ; \mathcal{D})-h(\mathbf{x})\}^{2}\right] \\ \quad=\underbrace{\left\{\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-h(\mathbf{x})\right\}^{2}}_{(\text {bias })^{2}}+\underbrace{\mathbb{E}_{\mathcal{D}}\left[\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]\right\}^{2}\right]}_{\text {variance }} \end{array}\]3-3. Bayesian Linear Regression
3-3-1. Parameter distribution
introduce a prior over \(w\) : \(p(\mathbf{w})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{m}_{0}, \mathbf{S}_{0}\right)\)
( from now on, treat \(\beta\) as a known constant )
posterior : \(p(\mathbf{w} \mid \mathbf{t})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{m}_{N}, \mathbf{S}_{N}\right)\)
- \(\mathbf{m}_{N}=\mathbf{S}_{N}\left(\mathbf{S}_{0}^{-1} \mathbf{m}_{0}+\beta \Phi^{\mathrm{T}} \mathbf{t}\right)\) ( = MAP of \(w\) )
- \(\mathbf{S}_{N}^{-1}=\mathbf{S}_{0}^{-1}+\beta \boldsymbol{\Phi}^{\mathrm{T}} \boldsymbol{\Phi}\).
Simplify our prior as
\(p(\mathbf{w} \mid \alpha)=\mathcal{N}\left(\mathbf{w} \mid \mathbf{0}, \alpha^{-1} \mathbf{I}\right)\).
Then, posterior : \(p(\mathbf{w} \mid \mathbf{t})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{m}_{N}, \mathbf{S}_{N}\right)\)
- \(\mathbf{m}_{N} =\beta \mathbf{S}_{N} \mathbf{\Phi}^{\mathrm{T}} \mathbf{t}\) ( = MAP of \(w\) )
- \(\mathbf{S}_{N}^{-1} =\alpha \mathbf{I}+\beta \boldsymbol{\Phi}^{\mathrm{T}} \boldsymbol{\Phi}\).
Log of the posterior distribution :
\[\ln p(\mathbf{w} \mid \mathbf{t})=-\frac{\beta}{2} \sum_{n=1}^{N}\left\{t_{n}-\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)\right\}^{2}-\frac{\alpha}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}+\text { const. }\]-
have quadratic regularization term!
( \(\lambda=\alpha / \beta\) from \(\frac{1}{2} \sum_{n=1}^{N}\left\{t_{n}-\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)\right\}^{2}+\frac{\lambda}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}\) )
3-3-2. Predictive Distribution
\(p(t \mid \mathbf{t}, \alpha, \beta)=\int p(t \mid \mathbf{w}, \beta) p(\mathbf{w} \mid \mathbf{t}, \alpha, \beta) \mathrm{d} \mathbf{w}\).
\(p(t \mid \mathbf{x}, \mathbf{t}, \alpha, \beta)=\mathcal{N}\left(t \mid \mathbf{m}_{N}^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{x}), \sigma_{N}^{2}(\mathbf{x})\right)\),
where \(\sigma_{N}^{2}(\mathrm{x})=\frac{1}{\beta}+\phi(\mathrm{x})^{\mathrm{T}} \mathrm{S}_{N} \phi(\mathrm{x})\)
\(\sigma_{N}^{2}(\mathrm{x})=\frac{1}{\beta}+\phi(\mathrm{x})^{\mathrm{T}} \mathrm{S}_{N} \phi(\mathrm{x})\).
- \(\frac{1}{\beta}\) : noise in the data
- \(\phi(\mathrm{x})^{\mathrm{T}} \mathrm{S}_{N} \phi(\mathrm{x})\) : uncertainty associated with \(w\)
3-3-3. Equivalent Kernel
predictive mean :
\[y\left(\mathbf{x}, \mathbf{m}_{N}\right)=\mathbf{m}_{N}^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{x})=\beta \boldsymbol{\phi}(\mathbf{x})^{\mathrm{T}} \mathbf{S}_{N} \boldsymbol{\Phi}^{\mathrm{T}} \mathbf{t}=\sum_{n=1}^{N} \beta \boldsymbol{\phi}(\mathbf{x})^{\mathrm{T}} \mathbf{S}_{N} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right) t_{n}\]Express the above another way!
\(y\left(\mathbf{x}, \mathbf{m}_{N}\right)=\sum_{n=1}^{N} k\left(\mathbf{x}, \mathbf{x}_{n}\right) t_{n} \\\), where \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\beta \phi(\mathbf{x})^{\mathrm{T}} \mathbf{S}_{N} \phi\left(\mathbf{x}^{\prime}\right)\)
\(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\) : smoother matrix ( = equivalent kernel )
- depends on the input values \(x_n\)
- \(\mathbf{S}_{N}^{-1} =\alpha \mathbf{I}+\beta \boldsymbol{\Phi}^{\mathrm{T}} \boldsymbol{\Phi}\).
Role of equivalent kernel can be obtained by the covariance!
\[\begin{aligned} \operatorname{cov}\left[y(\mathbf{x}), y\left(\mathbf{x}^{\prime}\right)\right] &=\operatorname{cov}\left[\phi(\mathbf{x})^{\mathrm{T}} \mathbf{w}, \mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right)\right] \\ &=\phi(\mathbf{x})^{\mathrm{T}} \mathbf{S}_{N} \phi\left(\mathbf{x}^{\prime}\right)=\beta^{-1} k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \end{aligned}\]( predictive mean at nearby points : highly correlated!)
Equivalent kernel satisfies the “inner product” property of kernel function!
\(k(\mathbf{x}, \mathbf{z})=\boldsymbol{\psi}(\mathbf{x})^{\mathrm{T}} \boldsymbol{\psi}(\mathbf{z})\),
where \(\boldsymbol{\psi}(\mathbf{x})=\beta^{1 / 2} \mathbf{S}_{N}^{1 / 2} \phi(\mathbf{x})\)
3-4. Bayesian Model Comparison
\[p\left(\mathcal{M}_{i} \mid \mathcal{D}\right) \propto p\left(\mathcal{M}_{i}\right) p\left(\mathcal{D} \mid \mathcal{M}_{i}\right)\]-
assume priors are identical! ( \(p(M_i)\) )
-
Then, just compare using “evidence”
Bayes factor
- ratio of model evidences
- \(p\left(\mathcal{D} \mid \mathcal{M}_{i}\right) / p\left(\mathcal{D} \mid \mathcal{M}_{j}\right)\).
if we average the Bayes factor over the distribution of data sets :
\[\int p\left(\mathcal{D} \mid \mathcal{M}_{1}\right) \ln \frac{p\left(\mathcal{D} \mid \mathcal{M}_{1}\right)}{p\left(\mathcal{D} \mid \mathcal{M}_{2}\right)} \mathrm{d} \mathcal{D}\]avoids the problem of over-fitting!
allows the models to be compared “only with the training set”
( but, have to make assumptions about the form of the model )
\(\leftrightarrow\) in practice, use validation/test dataset
3-5. Evidence Approximation
3-5-1. Evaluation of the evidence function
\[p(\mathbf{t} \mid \alpha, \beta)=\int p(\mathbf{t} \mid \mathbf{w}, \beta) p(\mathbf{w} \mid \alpha) \mathrm{d} \mathbf{w}\]if we assume Gaussian…
-
evidence function : \(p(\mathbf{t} \mid \alpha, \beta)=\left(\frac{\beta}{2 \pi}\right)^{N / 2}\left(\frac{\alpha}{2 \pi}\right)^{M / 2} \int \exp \{-E(\mathbf{w})\} \mathrm{d} \mathbf{w}\)
-
loss function :
\(\begin{aligned} E(\mathbf{w}) &=\beta E_{D}(\mathbf{w})+\alpha E_{W}(\mathbf{w}) \\ &=\frac{\beta}{2}\mid \mathbf{t}-\Phi \mathbf{w}\mid^{2}+\frac{\alpha}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w} \end{aligned}\).
-
loss function : (using Taylor Series Expansion)
\[E(\mathbf{w})=E\left(\mathbf{m}_{N}\right)+\frac{1}{2}\left(\mathbf{w}-\mathbf{m}_{N}\right)^{\mathrm{T}} \mathbf{A}\left(\mathbf{w}-\mathbf{m}_{N}\right)\]- \(E(\mathbf{m}_{N})=\frac{\beta}{2}\mid\mathbf{t}-\Phi \mathbf{m}_{N}\mid^{2}+\frac{\alpha}{2} \mathbf{m}_{N}^{\mathrm{T}} \mathbf{m}_{N}\).
- \(\mathbf{A}=\alpha \mathbf{I}+\beta \Phi^{\mathrm{T}} \Phi (=\nabla \nabla E(\mathbf{w}))\) …. “Hessian Matrix”
- \(\mathbf{m}_{N}=\beta \mathbf{A}^{-1} \mathbf{\Phi}^{\mathrm{T}} \mathbf{t}\).
Log of marginal likelihood :
\[\ln p(\mathbf{t} \mid \alpha, \beta)=\frac{M}{2} \ln \alpha+\frac{N}{2} \ln \beta-E\left(\mathbf{m}_{N}\right)-\frac{1}{2} \ln \mid\mathbf{A}\mid-\frac{N}{2} \ln (2 \pi)\]3-5-2. Maximizing the evidence function
maximize the evidence function \(p(\mathbf{t} \mid \alpha, \beta)\) , w.r.t \(\alpha\)
\((\) review : \(\ln p(\mathbf{t} \mid \alpha, \beta)=\frac{M}{2} \ln \alpha+\frac{N}{2}\ln\beta-E\left(\mathbf{m}_{N}\right)-\frac{1}{2}\ln\mid\mathbf{A}\mid-\frac{N}{2} \ln(2\pi)\) \()\)
[step 1] have to solve \(\frac{d}{d\alpha}\ln p(\mathbf{t} \mid \alpha, \beta)=0\)
- That is, \(0=\frac{M}{2\alpha}-\frac{1}{2}\mathbf{m}_{N}^{\mathrm{T}}\mathbf{m}_{N}-\frac{d}{d\alpha}\ln \mid\mathbf{A}\mid\)
[step 2] find \(\frac{d}{d\alpha}\ln\mid\mathbf{A}\mid\)
-
define \(\left(\beta \Phi^{\mathrm{T}} \Phi\right) \mathbf{u}_{i}=\lambda_{i} \mathbf{u}_{i}\)
-
since \(\mathbf{A}=\alpha \mathbf{I}+\beta \Phi^{\mathrm{T}} \Phi\),
\(\mathbf{A}\) has eigenvalues of \(\alpha + \lambda_i\)
-
\(\therefore\) \(\frac{d}{d \alpha} \ln \mid\mathbf{A}\mid=\frac{d}{d \alpha} \ln\prod_{i}\left(\lambda_{i}+\alpha\right)=\frac{d}{d \alpha} \sum_{i} \ln\left(\lambda_{i}+\alpha\right)=\sum_{i} \frac{1}{\lambda_{i}+\alpha}\)
[step 3] find the solution
-
\(0=\frac{M}{2 \alpha}-\frac{1}{2} \mathbf{m}_{N}^{\mathrm{T}} \mathbf{m}_{N}-\frac{1}{2} \sum_{i} \frac{1}{\lambda_{i}+\alpha}\).
\(\alpha \mathbf{m}_{N}^{\mathrm{T}} \mathbf{m}_{N}=M-\alpha \sum_{i} \frac{1}{\lambda_{i}+\alpha}=\gamma\).
\(\gamma=\sum_{i} \frac{\lambda_{i}}{\alpha+\lambda_{i}}\).
\(\therefore\) \(\alpha=\frac{\gamma}{\mathrm{m}_{N}^{\mathrm{T}} \mathrm{m}_{N}}\)
Find \(\beta\) the same way!