( Skip the basic parts + not important contents )
4. Linear Models for ClassificationPermalink
discriminant function :
directly assigns each vector x to a specific class
y(x)=f(wTx+w0)- f(⋅) : activation function
- f−1(⋅) : link function
4-1. Discriminant FunctionsPermalink
linear discriminant function : (total of K classes )
yk(x)=wTkx+wk0
4-2. Probabilistic Generative ModelsPermalink
skip
4-3. Probabilistic Discriminative ModelsPermalink
4-3-1. Fixed basis functionPermalink
fixed nonlinear transformation of the inputs using a vector of basis functions ϕ(x)
4-3-2. Logistic RegressionPermalink
p(C1∣ϕ)=y(ϕ)=σ(wTϕ) , where dσda=σ(1−σ)
Likelihood function : p(t∣w)=∏Nn=1ytnn{1−yn}1−tn
Cross-entropy error : E(w)=−lnp(t∣w)=−∑Nn=1{tnlnyn+(1−tn)ln(1−yn)}
- yn=σ(an).
- an=wTϕn.
Gradient of cross-entropy error : ∇E(w)=∑Nn=1(yn−tn)ϕn
4-3-3. Iterative reweighted least squares (IRLS)Permalink
IntroductionPermalink
error function : minimized by Newton-Raphson iterative optimization scheme
w(new )=w(old )−H−1∇E(w)H : Hessian matrix
( whose elements comprise the “second derivative” of E(w) w.r.t components of w )
∇E(w)=N∑n=1(wTϕn−tn)ϕn=ΦTΦw−ΦTtH=∇∇E(w)=N∑n=1ϕnϕTn=ΦTΦTherefore… updating equation will become :
w(new )=w(old )−(ΦTΦ)−1{ΦTΦw(old )−ΦTt}=(ΦTΦ)−1ΦTtApply this to “logistic model”Permalink
∇E(w)=N∑n=1(yn−tn)ϕn=ΦT(y−t)H=∇∇E(w)=N∑n=1yn(1−yn)ϕnϕTn=ΦTRΦwhere Rnn=yn(1−yn)
Updating equation
w(new)=w(old )−(ΦTRΦ)−1ΦT(y−t)=(ΦTRΦ)−1{ΦTRΦw(old)−ΦT(y−t)}=(ΦTRΦ)−1ΦTRzwhere z=Φw(old )−R−1(y−t)
4-3-4. Multiclass logistic regressionPermalink
p(Ck∣ϕ)=yk(ϕ)=exp(ak)∑jexp(aj) where activation function is ak=wTkϕ
Multiclass logistic regression
-
likelihood function : p(T∣w1,…,wK)=∏Nn=1∏Kk=1p(Ck∣ϕn)tnk=∏Nn=1∏Kk=1ytnknk
-
(NLL) negative log likelihood : E(w1,…,wK)=−lnp(T∣w1,…,wK)=−∑Nn=1∑Kk=1tnklnynk
( = cross entropy error )
-
derivative of NLL : ∇wjE(w1,…,wK)=∑Nn=1(ynj−tnj)ϕn
4-3-5. Probit regressionPermalink
probit function : Φ(a)=∫a−∞N(θ∣0,1)dθ
- erf function : erf(a)=2√π∫a0exp(−θ2/2)dθ
- re-express using erf : Φ(a)=12{1+1√2erf(a)}
4-4. Laplace ApproximationPermalink
aims to find a Gaussian approximation to a probability density
( find Gaussian approximation q(z) which is centerd on a mode of p(z) )
(a) 1-dimPermalink
step 1) find a mode of p(z)
- df(z)dz∣z=z0=0
step 2) use Taylor series expansion
-
lnf(z)≃lnf(z0)−12A(z−z0)2, where A=−d2dz2lnf(z)∣z=z0
-
take exponential
f(z)≃f(z0)exp{−A2(z−z0)2} -
normalize it
q(z)=(A2π)1/2exp{−A2(z−z0)2}
(b) M-dimPermalink
step 1) find a mode of p(z)
step 2) use Taylor series expansion
-
lnf(z)≃lnf(z0)−12(z−z0)TA(z−z0)
where A=−∇∇lnf(z)∣z=z0 ( H x H matrix )
-
take exponential
f(z)≃f(z0)exp{−12(z−z0)TA(z−z0)} -
normalize it
q(z)=∣A∣1/2(2π)M/2exp{−12(z−z0)TA(z−z0)}=N(z∣z0,A−1)
Summary :
in order to apply Laplace approximation…
- step 1) find the mode z0
- step 2) evaluate the Hessian matrix at mode
4-5. Bayesian Logistic regressionPermalink
Exact Bayesian inference for logistic regression is intractable!
→ use Laplace approximation
4-5-1. Laplace approximationPermalink
prior (Gaussian) : p(w)=N(w∣m0,S0)
likelihood : p(t∣w)=∏Nn=1ytnn{1−yn}1−tn
(log) Posterior :
lnp(w∣t)=−12(w−m0)TS−10(w−m0)+N∑n=1{tnlnyn+(1−tn)ln(1−yn)}+constcovariance
- inverse of the matrix of second derivative of NLL
- SN=−∇∇lnp(w∣t)=S−10+∑Nn=1yn(1−yn)ϕnϕTn.
Result : q(w)=N(w∣wMAP,SN)
4-5-2. Predictive DistributionPermalink
Predictive distribution for class C1
p(C1∣ϕ,t)=∫p(C1∣ϕ,w)p(w∣t)dw≃∫σ(wTϕ)q(w)dw-
p(C2∣ϕ,t)=1−p(C1∣ϕ,t).
-
denote a=wTϕ
σ(wTϕ)=∫δ(a−wTϕ)σ(a)da.
- dirac delta function : δ(x)={+∞,x=00,x≠0
Therefore, ∫σ(wTϕ)q(w)dw=∫σ(a)p(a)da
- p(a)=∫δ(a−wTϕ)q(w)dw .
- μa=E[a]=∫p(a)a da=∫q(w)wTϕdw=wTMAPϕ.
- σ2a=var[a]=∫p(a){a2−E[a]2}da=∫q(w){(wTϕ)2−(mTNϕ)2}dw=ϕTSNϕ.
Therefore, p(C1∣t)=∫σ(a)p(a)da=∫σ(a)N(a∣μa,σ2a)da
-
integral over a : Gaussian with a logistic sigmoid, and can not be evaluated analytically
-
approximate logistic sigmoid σ(a) with probit function Φ(λa)
( where λ2=π/8)
-
[Tip] why use probit function?
- its convolution with a Gaussian can be expressed analytically in terms for another probit function
- ∫Φ(λa)N(a∣μ,σ2)da=Φ(μ(λ−2+σ2)1/2).
Therefore, p(C1∣t)=∫σ(a)N(a∣μa,σ2a)da≃σ(κ(σ2)μ)
- κ(σ2)=(1+πσ2/8)−1/2.
[Result] approximate predictive distribution :Permalink
p(C1∣ϕ,t)=σ(κ(σ2a)μa)- μa=wTMAPϕ.
- σ2a=ϕTSNϕ.