Semi-Supervised Learning by Entropy Minimization (2004)


Contents

  1. Abstract
  2. Introduction
  3. Derivation of the Criterion
    1. Likelihood
    2. When are Unlabeled Examples Informative?
    3. Entropy Regularization


0. Abstract

Semi-supervised Learning

  • motivate minimum entropy regularization


1. Introduction

Supervised learning

  • dataset : \(\mathcal{L}_n=\left\{\mathbf{x}_i, y_i\right\}_{i=1}^n\)
    • where \(\mathbf{x}_i \in \mathcal{X}\) & \(y_i \in \Omega=\left\{\omega_1, \ldots, \omega_K\right\}\)

Semi-supervised learning

  • \(y_i \in \Omega=\left\{\omega_1, \ldots, \omega_K\right\}\) are limited to a subset of \(\mathcal{L}_n\).


Semi-superviesd learning as missing data problem

\(\rightarrow\) can be addressed by generative models

  • generative models : \(P(X, Y)\)

    • more demanding than discriminative models \(P(Y \mid X)\)

    • more parameters are needed & resulting in more uncertainty


Propose an estimation principle applicable to any probabilistic classifier


2. Derivation of the Criterion

(1) Likelihood

dataset : \(\mathcal{L}_n=\left\{\mathbf{x}_i, \mathbf{z}_i\right\}_{i=1}^n\)

  • \(\mathbf{z} \in\{0,1\}^K\) : dummy variable representing the actually available labels ( K classes )
    • if \(\mathbf{x}_i\) is labeled \(\omega_k\) :
      • \[\mathbf{z}_{i k}=1\]
      • \(\mathbf{z}_{i \ell}=0\) for \(\ell \neq k\)
    • if \(\mathbf{x}_i\) is unlabeled :
      • \(\mathbf{z}_{i \ell}=1\) for all \(l\)


Assumption : MAR (Missing At Random)

  • for all unlabeled examples, \(P\left(\mathbf{z} \mid \mathbf{x}, \omega_k\right)=P\left(\mathbf{z} \mid \mathbf{x}, \omega_{\ell}\right)\), for any \(\left(\omega_k, \omega_{\ell}\right)\) pair

  • it implies that \(P\left(\omega_k \mid \mathbf{x}, \mathbf{z}\right)=\frac{z_k P\left(\omega_k \mid \mathbf{x}\right)}{\sum_{\ell=1}^K z_{\ell} P\left(\omega_{\ell} \mid \mathbf{x}\right)} .\)


( under iid ) conditional log-likelihood of \((Z \mid X)\) :

  • \(L\left(\boldsymbol{\theta} ; \mathcal{L}_n\right)=\sum_{i=1}^n \log \left(\sum_{k=1}^K z_{i k} f_k\left(\mathbf{x}_i ; \boldsymbol{\theta}\right)\right)+h\left(\mathbf{z}_i\right)\),
    • \(h(\mathbf{z})\) : does not depend on \(P(X, Y)\)
    • \(f_k(\mathbf{x}_i ; \theta)\) : model of \(P(\omega_k \mid \mathbf{x})\)


Goal : maximize \(L\left(\boldsymbol{\theta} ; \mathcal{L}_n\right)\)


(2) When are Unlabeled Examples Informative?

Information content of unlabeled examples decreases as classes overlap!


Conditional entropy \(H(Y \mid X)\) :

  • measure of class overlap
  • Invariant to the parameterization of the model
  • related to the usefulness of unlabeled data


Measure the conditional entropy of class labels, conditioned on the observed variables

  • \(H(Y \mid X, Z)=-E_{X Y Z}[\log P(Y \mid X, Z)]\).


Prior :

  • in Bayesian framework, assumptions are encoded using “prior”
    • assumption : expect high conditional entropy
  • Maximum Entropy Prior : \(P(\boldsymbol{\theta}, \boldsymbol{\psi}) \propto \exp (-\lambda H(Y \mid X, Z)))\)
    • \((\boldsymbol{\theta}, \boldsymbol{\psi})\) : parameters of model \(P(X,Y,Z)\)
    • \(E_{\Theta \Psi}[H(Y \mid X, Z)]=c\), where
      • \(c\) : quantifies how small the entropy should be on average


Empirical estimation of entropy

  • \(H_{\mathrm{emp}}\left(Y \mid X, Z ; \mathcal{L}_n\right)=-\frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K P\left(\omega_k \mid \mathbf{x}_i, \mathbf{z}_i\right) \log P\left(\omega_k \mid \mathbf{x}_i, \mathbf{z}_i\right)\).


(3) Entropy Regularization

Model of \(P(\omega_k \mid \mathbf{x}, \mathbf{z})\) : \(g_k(\mathbf{x}, \mathbf{z} ; \boldsymbol{\theta})=\frac{z_k f_k(\mathbf{x} ; \boldsymbol{\theta})}{\sum_{\ell=1}^K z_{\ell} f_{\ell}(\mathbf{x} ; \boldsymbol{\theta})}\)

  • (for labeled data) : \(g_k(\mathbf{x}, \mathbf{z} ; \boldsymbol{\theta})=z_k\)

  • (for unlabeled data) : \(g_k(\mathbf{x}, \mathbf{z} ; \boldsymbol{\theta})=f_k(\mathbf{x} ; \boldsymbol{\theta})\)


MAP estimate :

maximizer of posterior distn, which is

  • \(\begin{aligned} C\left(\boldsymbol{\theta}, \lambda ; \mathcal{L}_n\right) &=L\left(\boldsymbol{\theta} ; \mathcal{L}_n\right)-\lambda H_{\mathrm{emp}}\left(Y \mid X, Z ; \mathcal{L}_n\right) \\ &=\sum_{i=1}^n \log \left(\sum_{k=1}^K z_{i k} f_k\left(\mathbf{x}_i\right)\right)+\lambda \sum_{i=1}^n \sum_{k=1}^K g_k\left(\mathbf{x}_i, \mathbf{z}_i\right) \log g_k\left(\mathbf{x}_i, \mathbf{z}_i\right),(6) \end{aligned}\).

    ( constant terms are dropped )

  • (1) \(L\left(\boldsymbol{\theta} ; \mathcal{L}_n\right)\) : only affected by labeled data

  • (2) \(H_{\mathrm{emp}}\left(Y \mid X, Z ; \mathcal{L}_n\right)\) : only affected by value of \(f_k(\mathbf{x})\) On unlabeled data


Categories:

Updated: