Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks (2013)

Abstract
Introduction
Pseudo-Label Method for DNN
1. Deep NN
2. DAE (Denoising Auto-Encoder)
3. Dropout
4. Pseudo-Label
Why could Pseudo-Label work?
1. Low-Density Separation between Classes
2. Entropy Regularization
3. Training with Pseudo-label as Entropy Regularization

0. Abstract

simple & efficient method of semi-supervised learning for DNN

trained in supervised fashion, with both labeled & unlabeled

Pseudo-Labels

labels of unlabeled data, chosen with maximum predicted probability
treat as if they were true labels

1. Introduction

Pseudo-labels :

favors a low-density separation between classes, a commonly assumed prior for semi-supervised learning.

\(\rightarrow\) same effect to Entropy Regulariaztion

Entropy Regularization

commonly used prior for semi-supervised learning
conditional entropy of the class prob : measure of class overlap
minimizing the entropy for unlabeled data

= reducing the overlap of class pdf

2. Pseudo-Label Method for DNN

(1) Deep NN

skip

(2) DAE (Denoising Auto-Encoder)

[ Encoding ]

\(h_i=s\left(\sum_{j=1}^{d_v} W_{i j} \widetilde{x}_j+b_i\right)\).

\(\widetilde{x}_j\) : corrupted version of \(j\)th input

[ Decoding ]

\(\widehat{x}_j=s\left(\sum_{i=1}^{d_h} W_{i j} h_i+a_j\right)\).

(3) Dropout

skip

(4) Pseudo-Label

Target classes for unlabeled data

\(y_i^{\prime}= \begin{cases}1 & \text { if } i=\operatorname{argmax}_{i^{\prime}} f_{i^{\prime}}(x) \\ 0 & \text { otherwise }\end{cases}\).

use Pseudo-label in fine-tuning phase

retrain pre-trained network with both labeled & unlabeled data

Total # of labeled & unlabeled data is different

\(\rightarrow\) Training balance is important

Overall loss function

\(L=\frac{1}{n} \sum^n \sum_{i=1}^C L\left(y_i^m, f_i^m\right)+\alpha(t) \frac{1}{n^{\prime}} \sum^{n^{\prime}} \sum_{i=1}^C L\left(y_i^{\prime m}, f_i^{\prime m}\right)\).

\(n\) : number of labeled data
\(n^{‘}\) : number of unlabeled data
\(\alpha(t)\) : coefficient of balance between the two
- high \(\rightarrow\) disturbs labeled data
- small \(\rightarrow\) no benefit of pseudo-labeling

Settings

\(\alpha(t)= \begin{cases}0 & t<T_1 \\ \frac{t-T_1}{T_2-T_1} \alpha_f & T_1 \leq t<T_2 \\ \alpha_f & T_2 \leq t\end{cases}\).

with \(\alpha_f=3, T_1=100, T_2=600\) without pre-training
\(T_1=200, T_2=800\) with DAE.

3. Why could Pseudo-Label work?

(1) Low-Density Separation between Classes

cluster assumption

the decision boundary should lie in low-density regions to improve generalization performance

(2) Entropy Regularization

means to benefit from unlabeled data, in the framework of MAP estimation

minimizing the conditional entropy of class probabilities of unlabeled data

MAP estimate :

\[C(\theta, \lambda)=\sum_{m=1}^n \log P\left(y^m \mid x^m ; \theta\right)-\lambda H\left(y \mid x^{\prime} ; \theta\right)\]
- where \(H\left(y \mid x^{\prime}\right)=-\frac{1}{n^{\prime}} \sum_{m=1}^{n^{\prime}} \sum_{i=1}^C P\left(y_i^m=1 \mid x^{\prime m}\right) \log P\left(y_i^m=1 \mid x^{\prime m}\right)\)

(3) Training with Pseudo-label as Entropy Regularization

Pseudo Label = Entropy Regularization

pseudo label : encourages the predicted class probabilities to be near 1-of-K code

[Entropy Regularzation]

\(C(\theta, \lambda)=\sum_{m=1}^n \log P\left(y^m \mid x^m ; \theta\right)-\lambda H\left(y \mid x^{\prime} ; \theta\right)\).

[Pseudo Label]

\[L=\frac{1}{n} \sum^n \sum_{i=1}^C L\left(y_i^m, f_i^m\right)+\alpha(t) \frac{1}{n^{\prime}} \sum^{n^{\prime}} \sum_{i=1}^C L\left(y_i^{\prime m}, f_i^{\prime m}\right)\]

Equivalence

\(\sum_{m=1}^n \log P\left(y^m \mid x^m ; \theta\right)\) & \(\frac{1}{n} \sum^n \sum_{i=1}^C L\left(y_i^m, f_i^m\right)\)
\(-\lambda H\left(y \mid x^{\prime} ; \theta\right)\) & \(\alpha(t) \frac{1}{n^{\prime}} \sum^{n^{\prime}} \sum_{i=1}^C L\left(y_i^{\prime m}, f_i^{\prime m}\right)\)

Twitter Facebook LinkedIn

(paper) SSL02 - Pseudo Label

Seunghan Lee