( Skip the basic parts + not important contents )

4. Linear Models for ClassificationPermalink

discriminant function :

​ directly assigns each vector x to a specific class

y(x)=f(wTx+w0)
  • f() : activation function
  • f1() : link function

4-1. Discriminant FunctionsPermalink

linear discriminant function : (total of K classes )

yk(x)=wTkx+wk0

4-2. Probabilistic Generative ModelsPermalink

skip

4-3. Probabilistic Discriminative ModelsPermalink

4-3-1. Fixed basis functionPermalink

fixed nonlinear transformation of the inputs using a vector of basis functions ϕ(x)

4-3-2. Logistic RegressionPermalink

p(C1ϕ)=y(ϕ)=σ(wTϕ) , where dσda=σ(1σ)


Likelihood function : p(tw)=Nn=1ytnn{1yn}1tn

Cross-entropy error : E(w)=lnp(tw)=Nn=1{tnlnyn+(1tn)ln(1yn)}

  • yn=σ(an).
  • an=wTϕn.

Gradient of cross-entropy error : E(w)=Nn=1(yntn)ϕn

4-3-3. Iterative reweighted least squares (IRLS)Permalink

IntroductionPermalink

error function : minimized by Newton-Raphson iterative optimization scheme

w(new )=w(old )H1E(w)

H : Hessian matrix

( whose elements comprise the “second derivative” of E(w) w.r.t components of w )

E(w)=Nn=1(wTϕntn)ϕn=ΦTΦwΦTtH=E(w)=Nn=1ϕnϕTn=ΦTΦ

Therefore… updating equation will become :

w(new )=w(old )(ΦTΦ)1{ΦTΦw(old )ΦTt}=(ΦTΦ)1ΦTt

Apply this to “logistic model”Permalink

E(w)=Nn=1(yntn)ϕn=ΦT(yt)H=E(w)=Nn=1yn(1yn)ϕnϕTn=ΦTRΦ

where Rnn=yn(1yn)

Updating equation

w(new)=w(old )(ΦTRΦ)1ΦT(yt)=(ΦTRΦ)1{ΦTRΦw(old)ΦT(yt)}=(ΦTRΦ)1ΦTRz

where z=Φw(old )R1(yt)

4-3-4. Multiclass logistic regressionPermalink

p(Ckϕ)=yk(ϕ)=exp(ak)jexp(aj)

​ where activation function is ak=wTkϕ

Multiclass logistic regression

  • likelihood function : p(Tw1,,wK)=Nn=1Kk=1p(Ckϕn)tnk=Nn=1Kk=1ytnknk

  • (NLL) negative log likelihood : E(w1,,wK)=lnp(Tw1,,wK)=Nn=1Kk=1tnklnynk

    ( = cross entropy error )

  • derivative of NLL : wjE(w1,,wK)=Nn=1(ynjtnj)ϕn

4-3-5. Probit regressionPermalink

probit function : Φ(a)=aN(θ0,1)dθ

  • erf function : erf(a)=2πa0exp(θ2/2)dθ
  • re-express using erf : Φ(a)=12{1+12erf(a)}

4-4. Laplace ApproximationPermalink

aims to find a Gaussian approximation to a probability density

( find Gaussian approximation q(z) which is centerd on a mode of p(z) )

(a) 1-dimPermalink

step 1) find a mode of p(z)

  • df(z)dzz=z0=0

step 2) use Taylor series expansion

  • lnf(z)lnf(z0)12A(zz0)2, where A=d2dz2lnf(z)z=z0

  • take exponential

    f(z)f(z0)exp{A2(zz0)2}
  • normalize it

    q(z)=(A2π)1/2exp{A2(zz0)2}

(b) M-dimPermalink

step 1) find a mode of p(z)

step 2) use Taylor series expansion

  • lnf(z)lnf(z0)12(zz0)TA(zz0)

    where A=lnf(z)z=z0 ( H x H matrix )

  • take exponential

    f(z)f(z0)exp{12(zz0)TA(zz0)}
  • normalize it

    q(z)=A1/2(2π)M/2exp{12(zz0)TA(zz0)}=N(zz0,A1)

Summary :

in order to apply Laplace approximation…

  • step 1) find the mode z0
  • step 2) evaluate the Hessian matrix at mode

4-5. Bayesian Logistic regressionPermalink

Exact Bayesian inference for logistic regression is intractable!

use Laplace approximation

4-5-1. Laplace approximationPermalink

prior (Gaussian) : p(w)=N(wm0,S0)

likelihood : p(tw)=Nn=1ytnn{1yn}1tn

(log) Posterior :

lnp(wt)=12(wm0)TS10(wm0)+Nn=1{tnlnyn+(1tn)ln(1yn)}+const

covariance

  • inverse of the matrix of second derivative of NLL
  • SN=lnp(wt)=S10+Nn=1yn(1yn)ϕnϕTn.

Result : q(w)=N(wwMAP,SN)

4-5-2. Predictive DistributionPermalink

Predictive distribution for class C1

p(C1ϕ,t)=p(C1ϕ,w)p(wt)dwσ(wTϕ)q(w)dw
  • p(C2ϕ,t)=1p(C1ϕ,t).

  • denote a=wTϕ

    σ(wTϕ)=δ(awTϕ)σ(a)da.

    • dirac delta function : δ(x)={+,x=00,x0


Therefore, σ(wTϕ)q(w)dw=σ(a)p(a)da

  • p(a)=δ(awTϕ)q(w)dw .
    • μa=E[a]=p(a)a da=q(w)wTϕdw=wTMAPϕ.
    • σ2a=var[a]=p(a){a2E[a]2}da=q(w){(wTϕ)2(mTNϕ)2}dw=ϕTSNϕ.


Therefore, p(C1t)=σ(a)p(a)da=σ(a)N(aμa,σ2a)da

  • integral over a : Gaussian with a logistic sigmoid, and can not be evaluated analytically

  • approximate logistic sigmoid σ(a) with probit function Φ(λa)

    ( where λ2=π/8)

  • [Tip] why use probit function?

    • its convolution with a Gaussian can be expressed analytically in terms for another probit function
    • Φ(λa)N(aμ,σ2)da=Φ(μ(λ2+σ2)1/2).


Therefore, p(C1t)=σ(a)N(aμa,σ2a)daσ(κ(σ2)μ)

  • κ(σ2)=(1+πσ2/8)1/2.


[Result] approximate predictive distribution :Permalink

p(C1ϕ,t)=σ(κ(σ2a)μa)
  • μa=wTMAPϕ.
  • σ2a=ϕTSNϕ.