Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks (2013)


  1. Abstract
  2. Introduction
  3. Pseudo-Label Method for DNN
    1. Deep NN
    2. DAE (Denoising Auto-Encoder)
    3. Dropout
    4. Pseudo-Label
  4. Why could Pseudo-Label work?
    1. Low-Density Separation between Classes
    2. Entropy Regularization
    3. Training with Pseudo-label as Entropy Regularization

0. Abstract

simple & efficient method of semi-supervised learning for DNN

  • trained in supervised fashion, with both labeled & unlabeled


  • labels of unlabeled data, chosen with maximum predicted probability
  • treat as if they were true labels

1. Introduction

Pseudo-labels :

  • favors a low-density separation between classes, a commonly assumed prior for semi-supervised learning.

\(\rightarrow\) same effect to Entropy Regulariaztion

Entropy Regularization

  • commonly used prior for semi-supervised learning

  • conditional entropy of the class prob : measure of class overlap

  • minimizing the entropy for unlabeled data

    = reducing the overlap of class pdf

2. Pseudo-Label Method for DNN

(1) Deep NN

  • skip

(2) DAE (Denoising Auto-Encoder)

[ Encoding ]

\(h_i=s\left(\sum_{j=1}^{d_v} W_{i j} \widetilde{x}_j+b_i\right)\).

  • \(\widetilde{x}_j\) : corrupted version of \(j\)th input

[ Decoding ]

\(\widehat{x}_j=s\left(\sum_{i=1}^{d_h} W_{i j} h_i+a_j\right)\).

(3) Dropout

  • skip

(4) Pseudo-Label

Target classes for unlabeled data

\(y_i^{\prime}= \begin{cases}1 & \text { if } i=\operatorname{argmax}_{i^{\prime}} f_{i^{\prime}}(x) \\ 0 & \text { otherwise }\end{cases}\).

use Pseudo-label in fine-tuning phase

  • retrain pre-trained network with both labeled & unlabeled data

Total # of labeled & unlabeled data is different

\(\rightarrow\) Training balance is important

Overall loss function

\(L=\frac{1}{n} \sum^n \sum_{i=1}^C L\left(y_i^m, f_i^m\right)+\alpha(t) \frac{1}{n^{\prime}} \sum^{n^{\prime}} \sum_{i=1}^C L\left(y_i^{\prime m}, f_i^{\prime m}\right)\).

  • \(n\) : number of labeled data
  • \(n^{‘}\) : number of unlabeled data

  • \(\alpha(t)\) : coefficient of balance between the two
    • high \(\rightarrow\) disturbs labeled data
    • small \(\rightarrow\) no benefit of pseudo-labeling


\(\alpha(t)= \begin{cases}0 & t<T_1 \\ \frac{t-T_1}{T_2-T_1} \alpha_f & T_1 \leq t<T_2 \\ \alpha_f & T_2 \leq t\end{cases}\).

  • with \(\alpha_f=3, T_1=100, T_2=600\) without pre-training
  • \(T_1=200, T_2=800\) with DAE.

3. Why could Pseudo-Label work?

(1) Low-Density Separation between Classes

cluster assumption

  • the decision boundary should lie in low-density regions to improve generalization performance

(2) Entropy Regularization

means to benefit from unlabeled data, in the framework of MAP estimation

  • minimizing the conditional entropy of class probabilities of unlabeled data

MAP estimate :

  • \[C(\theta, \lambda)=\sum_{m=1}^n \log P\left(y^m \mid x^m ; \theta\right)-\lambda H\left(y \mid x^{\prime} ; \theta\right)\]
    • where \(H\left(y \mid x^{\prime}\right)=-\frac{1}{n^{\prime}} \sum_{m=1}^{n^{\prime}} \sum_{i=1}^C P\left(y_i^m=1 \mid x^{\prime m}\right) \log P\left(y_i^m=1 \mid x^{\prime m}\right)\)

(3) Training with Pseudo-label as Entropy Regularization

Pseudo Label = Entropy Regularization

  • pseudo label : encourages the predicted class probabilities to be near 1-of-K code

[Entropy Regularzation]

\(C(\theta, \lambda)=\sum_{m=1}^n \log P\left(y^m \mid x^m ; \theta\right)-\lambda H\left(y \mid x^{\prime} ; \theta\right)\).

[Pseudo Label]

\[L=\frac{1}{n} \sum^n \sum_{i=1}^C L\left(y_i^m, f_i^m\right)+\alpha(t) \frac{1}{n^{\prime}} \sum^{n^{\prime}} \sum_{i=1}^C L\left(y_i^{\prime m}, f_i^{\prime m}\right)\]


  • \(\sum_{m=1}^n \log P\left(y^m \mid x^m ; \theta\right)\) & \(\frac{1}{n} \sum^n \sum_{i=1}^C L\left(y_i^m, f_i^m\right)\)
  • \(-\lambda H\left(y \mid x^{\prime} ; \theta\right)\) & \(\alpha(t) \frac{1}{n^{\prime}} \sum^{n^{\prime}} \sum_{i=1}^C L\left(y_i^{\prime m}, f_i^{\prime m}\right)\)

