( Skip the basic parts + not important contents )

3. Linear Models for Regression

3-1. Linear Basis Function Models

\[y(\mathbf{x}, \mathbf{w})=\sum_{j=0}^{M-1} w_{j} \phi_{j}(\mathbf{x})=\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{x})\]
  • \(\phi_j(x)\) : basis function

Example of basis functions

  • spline functions : divide the input space up into regions & fit a different polynomial for each

  • Gaussian basis function (rbf) :

    \[\phi_{j}(x)=\exp \left\{-\frac{\left(x-\mu_{j}\right)^{2}}{2 s^{2}}\right\}\]
  • Sigmoidal basis function :

    \(\phi_{j}(x)=\sigma\left(\frac{x-\mu_{j}}{s}\right)\) where \(\sigma(a)=\frac{1}{1+\exp (-a)}\)

3-1-1. MLE and Least squares

\[t=y(\mathbf{x}, \mathbf{w})+\epsilon\]
  • likelihood : \(p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta)=\prod_{n=1}^{N} \mathcal{N}\left(t_{n} \mid \mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right), \beta^{-1}\right)\)

  • log likelihood : \(\begin{aligned} \ln p(\mathbf{t} \mid \mathbf{w}, \beta) &=\sum_{n=1}^{N} \ln \mathcal{N}\left(t_{n} \mid \mathbf{w}^{\mathrm{T}} \phi\left(\mathbf{x}_{n}\right), \beta^{-1}\right) =\frac{N}{2} \ln \beta-\frac{N}{2} \ln (2 \pi)-\beta E_{D}(\mathbf{w}) \end{aligned}\)

    where \(E_{D}(\mathbf{w})\) is SSE function , \(E_{D}(\mathbf{w})=\frac{1}{2} \sum_{n=1}^{N}\left\{t_{n}-\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)\right\}^{2}\)

If we solve ML : ( maximize log likelihood )

\[\mathbf{w}_{\mathrm{ML}}=\left(\mathbf{\Phi}^{\mathrm{T}} \mathbf{\Phi}\right)^{-1} \mathbf{\Phi}^{\mathrm{T}} \mathbf{t}\]

where \(\Phi=\left(\begin{array}{cccc} \phi_{0}\left(\mathrm{x}_{1}\right) & \phi_{1}\left(\mathrm{x}_{1}\right) & \cdots & \phi_{M-1}\left(\mathrm{x}_{1}\right) \\ \phi_{0}\left(\mathrm{x}_{2}\right) & \phi_{1}\left(\mathrm{x}_{2}\right) & \cdots & \phi_{M-1}\left(\mathrm{x}_{2}\right) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_{0}\left(\mathrm{x}_{N}\right) & \phi_{1}\left(\mathrm{x}_{N}\right) & \cdots & \phi_{M-1}\left(\mathrm{x}_{N}\right) \end{array}\right)\)

3-1-2. Sequential Learning

\[\begin{aligned}\mathbf{w}^{(\tau+1)}&=\mathbf{w}^{(\tau)}-\eta \nabla E_{n} \\&= \mathbf{w}^{(\tau)}+\eta\left(t_{n}-\mathbf{w}^{(\tau) \mathrm{T}} \boldsymbol{\phi}_{n}\right) \boldsymbol{\phi}_{n}\end{aligned}\]

3-1-3. Regularized Least Squares

to control overfitting! ( use weight decay )

total error = \(E_{D}(\mathbf{w})+\lambda E_{W}(\mathbf{w})\)

ex) \(E_{W}(\mathbf{w})=\frac{1}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}\)

  • then, total error function : \(\frac{1}{2} \sum_{n=1}^{N}\left\{t_{n}-\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)\right\}^{2}+\frac{\lambda}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}\)

  • solution : \(\mathbf{w}=\left(\lambda \mathbf{I}+\mathbf{\Phi}^{\mathrm{T}} \mathbf{\Phi}\right)^{-1} \mathbf{\Phi}^{\mathrm{T}} \mathbf{t}\)

general regularizer

\(\frac{1}{2} \sum_{n=1}^{N}\{t_{n}-\mathbf{w}^{\mathrm{T}}\boldsymbol{\phi}(\mathbf{x}_{n})\}^{2}+\frac{\lambda}{2} \sum_{j=1}^{M}\mid w_{j}\mid^{q}\).

  • \(q=1\) : Lasso
  • \(q=2\) : Ridge

Minimizing the “regularized” loss function

= Minimizing “un-regularized” loss function

\(+\) Constraint ( use Lagrange multipliers ) \(\sum_{j=1}^{M}\mid w_{j}\mid^{q} \leqslant \eta\)

3-2. Bias-Variance Decomposition

\[\begin{array}{l} \mathbb{E}_{\mathcal{D}}\left[\{y(\mathbf{x} ; \mathcal{D})-h(\mathbf{x})\}^{2}\right] \\ \quad=\underbrace{\left\{\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-h(\mathbf{x})\right\}^{2}}_{(\text {bias })^{2}}+\underbrace{\mathbb{E}_{\mathcal{D}}\left[\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]\right\}^{2}\right]}_{\text {variance }} \end{array}\]

3-3. Bayesian Linear Regression

3-3-1. Parameter distribution

introduce a prior over \(w\) : \(p(\mathbf{w})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{m}_{0}, \mathbf{S}_{0}\right)\)

( from now on, treat \(\beta\) as a known constant )

posterior : \(p(\mathbf{w} \mid \mathbf{t})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{m}_{N}, \mathbf{S}_{N}\right)\)

  • \(\mathbf{m}_{N}=\mathbf{S}_{N}\left(\mathbf{S}_{0}^{-1} \mathbf{m}_{0}+\beta \Phi^{\mathrm{T}} \mathbf{t}\right)\) ( = MAP of \(w\) )
  • \(\mathbf{S}_{N}^{-1}=\mathbf{S}_{0}^{-1}+\beta \boldsymbol{\Phi}^{\mathrm{T}} \boldsymbol{\Phi}\).

Simplify our prior as

\(p(\mathbf{w} \mid \alpha)=\mathcal{N}\left(\mathbf{w} \mid \mathbf{0}, \alpha^{-1} \mathbf{I}\right)\).

Then, posterior : \(p(\mathbf{w} \mid \mathbf{t})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{m}_{N}, \mathbf{S}_{N}\right)\)

  • \(\mathbf{m}_{N} =\beta \mathbf{S}_{N} \mathbf{\Phi}^{\mathrm{T}} \mathbf{t}\) ( = MAP of \(w\) )
  • \(\mathbf{S}_{N}^{-1} =\alpha \mathbf{I}+\beta \boldsymbol{\Phi}^{\mathrm{T}} \boldsymbol{\Phi}\).

Log of the posterior distribution :

\[\ln p(\mathbf{w} \mid \mathbf{t})=-\frac{\beta}{2} \sum_{n=1}^{N}\left\{t_{n}-\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)\right\}^{2}-\frac{\alpha}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}+\text { const. }\]
  • have quadratic regularization term!

    ( \(\lambda=\alpha / \beta\) from \(\frac{1}{2} \sum_{n=1}^{N}\left\{t_{n}-\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)\right\}^{2}+\frac{\lambda}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}\) )

3-3-2. Predictive Distribution

\(p(t \mid \mathbf{t}, \alpha, \beta)=\int p(t \mid \mathbf{w}, \beta) p(\mathbf{w} \mid \mathbf{t}, \alpha, \beta) \mathrm{d} \mathbf{w}\).

\(p(t \mid \mathbf{x}, \mathbf{t}, \alpha, \beta)=\mathcal{N}\left(t \mid \mathbf{m}_{N}^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{x}), \sigma_{N}^{2}(\mathbf{x})\right)\),

where \(\sigma_{N}^{2}(\mathrm{x})=\frac{1}{\beta}+\phi(\mathrm{x})^{\mathrm{T}} \mathrm{S}_{N} \phi(\mathrm{x})\)


\(\sigma_{N}^{2}(\mathrm{x})=\frac{1}{\beta}+\phi(\mathrm{x})^{\mathrm{T}} \mathrm{S}_{N} \phi(\mathrm{x})\).

  • \(\frac{1}{\beta}\) : noise in the data
  • \(\phi(\mathrm{x})^{\mathrm{T}} \mathrm{S}_{N} \phi(\mathrm{x})\) : uncertainty associated with \(w\)

3-3-3. Equivalent Kernel

predictive mean :

\[y\left(\mathbf{x}, \mathbf{m}_{N}\right)=\mathbf{m}_{N}^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{x})=\beta \boldsymbol{\phi}(\mathbf{x})^{\mathrm{T}} \mathbf{S}_{N} \boldsymbol{\Phi}^{\mathrm{T}} \mathbf{t}=\sum_{n=1}^{N} \beta \boldsymbol{\phi}(\mathbf{x})^{\mathrm{T}} \mathbf{S}_{N} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right) t_{n}\]

Express the above another way!

\(y\left(\mathbf{x}, \mathbf{m}_{N}\right)=\sum_{n=1}^{N} k\left(\mathbf{x}, \mathbf{x}_{n}\right) t_{n} \\\), where \(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\beta \phi(\mathbf{x})^{\mathrm{T}} \mathbf{S}_{N} \phi\left(\mathbf{x}^{\prime}\right)\)

\(k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\) : smoother matrix ( = equivalent kernel )

  • depends on the input values \(x_n\)
  • \(\mathbf{S}_{N}^{-1} =\alpha \mathbf{I}+\beta \boldsymbol{\Phi}^{\mathrm{T}} \boldsymbol{\Phi}\).

Role of equivalent kernel can be obtained by the covariance!

\[\begin{aligned} \operatorname{cov}\left[y(\mathbf{x}), y\left(\mathbf{x}^{\prime}\right)\right] &=\operatorname{cov}\left[\phi(\mathbf{x})^{\mathrm{T}} \mathbf{w}, \mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right)\right] \\ &=\phi(\mathbf{x})^{\mathrm{T}} \mathbf{S}_{N} \phi\left(\mathbf{x}^{\prime}\right)=\beta^{-1} k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \end{aligned}\]

( predictive mean at nearby points : highly correlated!)

Equivalent kernel satisfies the “inner product” property of kernel function!

\(k(\mathbf{x}, \mathbf{z})=\boldsymbol{\psi}(\mathbf{x})^{\mathrm{T}} \boldsymbol{\psi}(\mathbf{z})\),

​ where \(\boldsymbol{\psi}(\mathbf{x})=\beta^{1 / 2} \mathbf{S}_{N}^{1 / 2} \phi(\mathbf{x})\)

3-4. Bayesian Model Comparison

\[p\left(\mathcal{M}_{i} \mid \mathcal{D}\right) \propto p\left(\mathcal{M}_{i}\right) p\left(\mathcal{D} \mid \mathcal{M}_{i}\right)\]
  • assume priors are identical! ( \(p(M_i)\) )

  • Then, just compare using “evidence”

Bayes factor

  • ratio of model evidences
  • \(p\left(\mathcal{D} \mid \mathcal{M}_{i}\right) / p\left(\mathcal{D} \mid \mathcal{M}_{j}\right)\).

if we average the Bayes factor over the distribution of data sets :

\[\int p\left(\mathcal{D} \mid \mathcal{M}_{1}\right) \ln \frac{p\left(\mathcal{D} \mid \mathcal{M}_{1}\right)}{p\left(\mathcal{D} \mid \mathcal{M}_{2}\right)} \mathrm{d} \mathcal{D}\]

avoids the problem of over-fitting!

allows the models to be compared “only with the training set”

( but, have to make assumptions about the form of the model )

\(\leftrightarrow\) in practice, use validation/test dataset

3-5. Evidence Approximation

3-5-1. Evaluation of the evidence function

\[p(\mathbf{t} \mid \alpha, \beta)=\int p(\mathbf{t} \mid \mathbf{w}, \beta) p(\mathbf{w} \mid \alpha) \mathrm{d} \mathbf{w}\]

if we assume Gaussian…

  • evidence function : \(p(\mathbf{t} \mid \alpha, \beta)=\left(\frac{\beta}{2 \pi}\right)^{N / 2}\left(\frac{\alpha}{2 \pi}\right)^{M / 2} \int \exp \{-E(\mathbf{w})\} \mathrm{d} \mathbf{w}\)

  • loss function :

    \(\begin{aligned} E(\mathbf{w}) &=\beta E_{D}(\mathbf{w})+\alpha E_{W}(\mathbf{w}) \\ &=\frac{\beta}{2}\mid \mathbf{t}-\Phi \mathbf{w}\mid^{2}+\frac{\alpha}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w} \end{aligned}\).

  • loss function : (using Taylor Series Expansion)

    \[E(\mathbf{w})=E\left(\mathbf{m}_{N}\right)+\frac{1}{2}\left(\mathbf{w}-\mathbf{m}_{N}\right)^{\mathrm{T}} \mathbf{A}\left(\mathbf{w}-\mathbf{m}_{N}\right)\]
    • \(E(\mathbf{m}_{N})=\frac{\beta}{2}\mid\mathbf{t}-\Phi \mathbf{m}_{N}\mid^{2}+\frac{\alpha}{2} \mathbf{m}_{N}^{\mathrm{T}} \mathbf{m}_{N}\).
    • \(\mathbf{A}=\alpha \mathbf{I}+\beta \Phi^{\mathrm{T}} \Phi (=\nabla \nabla E(\mathbf{w}))\) …. “Hessian Matrix”
    • \(\mathbf{m}_{N}=\beta \mathbf{A}^{-1} \mathbf{\Phi}^{\mathrm{T}} \mathbf{t}\).

Log of marginal likelihood :

\[\ln p(\mathbf{t} \mid \alpha, \beta)=\frac{M}{2} \ln \alpha+\frac{N}{2} \ln \beta-E\left(\mathbf{m}_{N}\right)-\frac{1}{2} \ln \mid\mathbf{A}\mid-\frac{N}{2} \ln (2 \pi)\]

3-5-2. Maximizing the evidence function

maximize the evidence function \(p(\mathbf{t} \mid \alpha, \beta)\) , w.r.t \(\alpha\)

\((\) review : \(\ln p(\mathbf{t} \mid \alpha, \beta)=\frac{M}{2} \ln \alpha+\frac{N}{2}\ln\beta-E\left(\mathbf{m}_{N}\right)-\frac{1}{2}\ln\mid\mathbf{A}\mid-\frac{N}{2} \ln(2\pi)\) \()\)


[step 1] have to solve \(\frac{d}{d\alpha}\ln p(\mathbf{t} \mid \alpha, \beta)=0\)

  • That is, \(0=\frac{M}{2\alpha}-\frac{1}{2}\mathbf{m}_{N}^{\mathrm{T}}\mathbf{m}_{N}-\frac{d}{d\alpha}\ln \mid\mathbf{A}\mid\)


[step 2] find \(\frac{d}{d\alpha}\ln\mid\mathbf{A}\mid\)

  • define \(\left(\beta \Phi^{\mathrm{T}} \Phi\right) \mathbf{u}_{i}=\lambda_{i} \mathbf{u}_{i}\)

  • since \(\mathbf{A}=\alpha \mathbf{I}+\beta \Phi^{\mathrm{T}} \Phi\),

    \(\mathbf{A}\) has eigenvalues of \(\alpha + \lambda_i\)

  • \(\therefore\) \(\frac{d}{d \alpha} \ln \mid\mathbf{A}\mid=\frac{d}{d \alpha} \ln\prod_{i}\left(\lambda_{i}+\alpha\right)=\frac{d}{d \alpha} \sum_{i} \ln\left(\lambda_{i}+\alpha\right)=\sum_{i} \frac{1}{\lambda_{i}+\alpha}\)


[step 3] find the solution

  • \(0=\frac{M}{2 \alpha}-\frac{1}{2} \mathbf{m}_{N}^{\mathrm{T}} \mathbf{m}_{N}-\frac{1}{2} \sum_{i} \frac{1}{\lambda_{i}+\alpha}\).

    \(\alpha \mathbf{m}_{N}^{\mathrm{T}} \mathbf{m}_{N}=M-\alpha \sum_{i} \frac{1}{\lambda_{i}+\alpha}=\gamma\).

    \(\gamma=\sum_{i} \frac{\lambda_{i}}{\alpha+\lambda_{i}}\).

    \(\therefore\) \(\alpha=\frac{\gamma}{\mathrm{m}_{N}^{\mathrm{T}} \mathrm{m}_{N}}\)

Find \(\beta\) the same way!