Semi-Supervised Learning by Entropy Minimization (2004)
Contents
- Abstract
- Introduction
- Derivation of the Criterion
- Likelihood
- When are Unlabeled Examples Informative?
- Entropy Regularization
0. Abstract
Semi-supervised Learning
- motivate minimum entropy regularization
1. Introduction
Supervised learning
- dataset : \(\mathcal{L}_n=\left\{\mathbf{x}_i, y_i\right\}_{i=1}^n\)
- where \(\mathbf{x}_i \in \mathcal{X}\) & \(y_i \in \Omega=\left\{\omega_1, \ldots, \omega_K\right\}\)
Semi-supervised learning
- \(y_i \in \Omega=\left\{\omega_1, \ldots, \omega_K\right\}\) are limited to a subset of \(\mathcal{L}_n\).
Semi-superviesd learning as missing data problem
\(\rightarrow\) can be addressed by generative models
-
generative models : \(P(X, Y)\)
-
more demanding than discriminative models \(P(Y \mid X)\)
-
more parameters are needed & resulting in more uncertainty
-
Propose an estimation principle applicable to any probabilistic classifier
2. Derivation of the Criterion
(1) Likelihood
dataset : \(\mathcal{L}_n=\left\{\mathbf{x}_i, \mathbf{z}_i\right\}_{i=1}^n\)
- \(\mathbf{z} \in\{0,1\}^K\) : dummy variable representing the actually available labels ( K classes )
- if \(\mathbf{x}_i\) is labeled \(\omega_k\) :
- \[\mathbf{z}_{i k}=1\]
- \(\mathbf{z}_{i \ell}=0\) for \(\ell \neq k\)
- if \(\mathbf{x}_i\) is unlabeled :
- \(\mathbf{z}_{i \ell}=1\) for all \(l\)
- if \(\mathbf{x}_i\) is labeled \(\omega_k\) :
Assumption : MAR (Missing At Random)
-
for all unlabeled examples, \(P\left(\mathbf{z} \mid \mathbf{x}, \omega_k\right)=P\left(\mathbf{z} \mid \mathbf{x}, \omega_{\ell}\right)\), for any \(\left(\omega_k, \omega_{\ell}\right)\) pair
-
it implies that \(P\left(\omega_k \mid \mathbf{x}, \mathbf{z}\right)=\frac{z_k P\left(\omega_k \mid \mathbf{x}\right)}{\sum_{\ell=1}^K z_{\ell} P\left(\omega_{\ell} \mid \mathbf{x}\right)} .\)
( under iid ) conditional log-likelihood of \((Z \mid X)\) :
- \(L\left(\boldsymbol{\theta} ; \mathcal{L}_n\right)=\sum_{i=1}^n \log \left(\sum_{k=1}^K z_{i k} f_k\left(\mathbf{x}_i ; \boldsymbol{\theta}\right)\right)+h\left(\mathbf{z}_i\right)\),
- \(h(\mathbf{z})\) : does not depend on \(P(X, Y)\)
- \(f_k(\mathbf{x}_i ; \theta)\) : model of \(P(\omega_k \mid \mathbf{x})\)
Goal : maximize \(L\left(\boldsymbol{\theta} ; \mathcal{L}_n\right)\)
(2) When are Unlabeled Examples Informative?
Information content of unlabeled examples decreases as classes overlap!
Conditional entropy \(H(Y \mid X)\) :
- measure of class overlap
- Invariant to the parameterization of the model
- related to the usefulness of unlabeled data
Measure the conditional entropy of class labels, conditioned on the observed variables
- \(H(Y \mid X, Z)=-E_{X Y Z}[\log P(Y \mid X, Z)]\).
Prior :
- in Bayesian framework, assumptions are encoded using “prior”
- assumption : expect high conditional entropy
- Maximum Entropy Prior : \(P(\boldsymbol{\theta}, \boldsymbol{\psi}) \propto \exp (-\lambda H(Y \mid X, Z)))\)
- \((\boldsymbol{\theta}, \boldsymbol{\psi})\) : parameters of model \(P(X,Y,Z)\)
- \(E_{\Theta \Psi}[H(Y \mid X, Z)]=c\), where
- \(c\) : quantifies how small the entropy should be on average
Empirical estimation of entropy
- \(H_{\mathrm{emp}}\left(Y \mid X, Z ; \mathcal{L}_n\right)=-\frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K P\left(\omega_k \mid \mathbf{x}_i, \mathbf{z}_i\right) \log P\left(\omega_k \mid \mathbf{x}_i, \mathbf{z}_i\right)\).
(3) Entropy Regularization
Model of \(P(\omega_k \mid \mathbf{x}, \mathbf{z})\) : \(g_k(\mathbf{x}, \mathbf{z} ; \boldsymbol{\theta})=\frac{z_k f_k(\mathbf{x} ; \boldsymbol{\theta})}{\sum_{\ell=1}^K z_{\ell} f_{\ell}(\mathbf{x} ; \boldsymbol{\theta})}\)
-
(for labeled data) : \(g_k(\mathbf{x}, \mathbf{z} ; \boldsymbol{\theta})=z_k\)
-
(for unlabeled data) : \(g_k(\mathbf{x}, \mathbf{z} ; \boldsymbol{\theta})=f_k(\mathbf{x} ; \boldsymbol{\theta})\)
MAP estimate :
maximizer of posterior distn, which is
-
\(\begin{aligned} C\left(\boldsymbol{\theta}, \lambda ; \mathcal{L}_n\right) &=L\left(\boldsymbol{\theta} ; \mathcal{L}_n\right)-\lambda H_{\mathrm{emp}}\left(Y \mid X, Z ; \mathcal{L}_n\right) \\ &=\sum_{i=1}^n \log \left(\sum_{k=1}^K z_{i k} f_k\left(\mathbf{x}_i\right)\right)+\lambda \sum_{i=1}^n \sum_{k=1}^K g_k\left(\mathbf{x}_i, \mathbf{z}_i\right) \log g_k\left(\mathbf{x}_i, \mathbf{z}_i\right),(6) \end{aligned}\).
( constant terms are dropped )
-
(1) \(L\left(\boldsymbol{\theta} ; \mathcal{L}_n\right)\) : only affected by labeled data
-
(2) \(H_{\mathrm{emp}}\left(Y \mid X, Z ; \mathcal{L}_n\right)\) : only affected by value of \(f_k(\mathbf{x})\) On unlabeled data