Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks (2013)
Contents
- Abstract
- Introduction
- Pseudo-Label Method for DNN
- Deep NN
- DAE (Denoising Auto-Encoder)
- Dropout
- Pseudo-Label
- Why could Pseudo-Label work?
- Low-Density Separation between Classes
- Entropy Regularization
- Training with Pseudo-label as Entropy Regularization
0. Abstract
simple & efficient method of semi-supervised learning for DNN
- trained in supervised fashion, with both labeled & unlabeled
Pseudo-Labels
- labels of unlabeled data, chosen with maximum predicted probability
- treat as if they were true labels
1. Introduction
Pseudo-labels :
- favors a low-density separation between classes, a commonly assumed prior for semi-supervised learning.
\(\rightarrow\) same effect to Entropy Regulariaztion
Entropy Regularization
-
commonly used prior for semi-supervised learning
-
conditional entropy of the class prob : measure of class overlap
-
minimizing the entropy for unlabeled data
= reducing the overlap of class pdf
2. Pseudo-Label Method for DNN
(1) Deep NN
- skip
(2) DAE (Denoising Auto-Encoder)
[ Encoding ]
\(h_i=s\left(\sum_{j=1}^{d_v} W_{i j} \widetilde{x}_j+b_i\right)\).
- \(\widetilde{x}_j\) : corrupted version of \(j\)th input
[ Decoding ]
\(\widehat{x}_j=s\left(\sum_{i=1}^{d_h} W_{i j} h_i+a_j\right)\).
(3) Dropout
- skip
(4) Pseudo-Label
Target classes for unlabeled data
\(y_i^{\prime}= \begin{cases}1 & \text { if } i=\operatorname{argmax}_{i^{\prime}} f_{i^{\prime}}(x) \\ 0 & \text { otherwise }\end{cases}\).
use Pseudo-label in fine-tuning phase
- retrain pre-trained network with both labeled & unlabeled data
Total # of labeled & unlabeled data is different
\(\rightarrow\) Training balance is important
Overall loss function
\(L=\frac{1}{n} \sum^n \sum_{i=1}^C L\left(y_i^m, f_i^m\right)+\alpha(t) \frac{1}{n^{\prime}} \sum^{n^{\prime}} \sum_{i=1}^C L\left(y_i^{\prime m}, f_i^{\prime m}\right)\).
- \(n\) : number of labeled data
-
\(n^{‘}\) : number of unlabeled data
- \(\alpha(t)\) : coefficient of balance between the two
- high \(\rightarrow\) disturbs labeled data
- small \(\rightarrow\) no benefit of pseudo-labeling
Settings
\(\alpha(t)= \begin{cases}0 & t<T_1 \\ \frac{t-T_1}{T_2-T_1} \alpha_f & T_1 \leq t<T_2 \\ \alpha_f & T_2 \leq t\end{cases}\).
- with \(\alpha_f=3, T_1=100, T_2=600\) without pre-training
- \(T_1=200, T_2=800\) with DAE.
3. Why could Pseudo-Label work?
(1) Low-Density Separation between Classes
cluster assumption
- the decision boundary should lie in low-density regions to improve generalization performance
(2) Entropy Regularization
means to benefit from unlabeled data, in the framework of MAP estimation
- minimizing the conditional entropy of class probabilities of unlabeled data
MAP estimate :
-
\[C(\theta, \lambda)=\sum_{m=1}^n \log P\left(y^m \mid x^m ; \theta\right)-\lambda H\left(y \mid x^{\prime} ; \theta\right)\]
- where \(H\left(y \mid x^{\prime}\right)=-\frac{1}{n^{\prime}} \sum_{m=1}^{n^{\prime}} \sum_{i=1}^C P\left(y_i^m=1 \mid x^{\prime m}\right) \log P\left(y_i^m=1 \mid x^{\prime m}\right)\)
(3) Training with Pseudo-label as Entropy Regularization
Pseudo Label = Entropy Regularization
- pseudo label : encourages the predicted class probabilities to be near 1-of-K code
[Entropy Regularzation]
\(C(\theta, \lambda)=\sum_{m=1}^n \log P\left(y^m \mid x^m ; \theta\right)-\lambda H\left(y \mid x^{\prime} ; \theta\right)\).
[Pseudo Label]
\[L=\frac{1}{n} \sum^n \sum_{i=1}^C L\left(y_i^m, f_i^m\right)+\alpha(t) \frac{1}{n^{\prime}} \sum^{n^{\prime}} \sum_{i=1}^C L\left(y_i^{\prime m}, f_i^{\prime m}\right)\]Equivalence
- \(\sum_{m=1}^n \log P\left(y^m \mid x^m ; \theta\right)\) & \(\frac{1}{n} \sum^n \sum_{i=1}^C L\left(y_i^m, f_i^m\right)\)
- \(-\lambda H\left(y \mid x^{\prime} ; \theta\right)\) & \(\alpha(t) \frac{1}{n^{\prime}} \sum^{n^{\prime}} \sum_{i=1}^C L\left(y_i^{\prime m}, f_i^{\prime m}\right)\)