An Overview of Deep Semi-Supervised Learning (2020) - Part 4
Contents
- Abstract
- Introduction
- SSL
- SSL Methods
- Main Assumptions in SSL
- Related Problems
- Consistency Regularization
- Ladder Networks
- Pi-Model
- Temporal Ensembling
- Mean Teachers
- Dual Students
- Fast-SWA
- Virtual Adversarial Training (VAT)
- Adversarial Dropout (AdD)
- Interpolation Consistency Training (ICT)
- Unsupervised Data Augmentation
- Entropy Minimization
- Proxy-label Methods
- Self-training
- Multi-view Training
- Holistic Methods
- MixMatch
- ReMixMatch
- FixMatch
- Generative Models
- VAE for SSL
- GAN for SSL
- Graph-Based SSL
- Graph Construction
- Label Propagation
- Self-Supervision for SSL
7. Graph-Based SSL
Setting : each data point \(x_i\) : labeled or unlabeled
Notation : \(G(V, E)\)
- node : \(V=\left\{x_1, \ldots, x_n\right\}\)
- edge : \(E=\left\{e_{i j}\right\}_{i, j=1}^n\)
- adjacency matrix : \(A\)
- each element as a non-negative weight
- if not connected … \(A_{i j}=0\)
Graph-based tasks
- (1) node classification
- (2) link prediction
- (3) clustering
- (4) visualization
Graph methods
- (1) transductive
- only capable of producing labels assignments of the examples seen during training
- (2) inductive
- more generalizable, and can be transferred and applied to unseen examples
will discuss node classification approaches
can be grouped into…
-
(1) methods which propagate the labels from labeled to unlabeled
- assumption : nearby nodes tend to have the same labels
-
(2) methods which learn node embeddings
- assumption : nearby nodes should have similar embeddings in vector space
& apply classifiers on the learned embeddings
(1) Graph Construction
we first need a graph ! graph can be represented as…
- (1) adjacency matrix \(A\)
- (2) constructed to reflect the similarity of the nodes
In case we have limited domain knowledge…
\(\rightarrow\) create graphs!
-
(1) Fully connected graphs
- fully connected with weighted edges between all pairs of data
- high computational costs
-
(2) Sparse graphs
-
each node is only connected to a few similar nodes
( connections to dissimilar nodes are removed )
- Ex) kNN graphs
- \(i\) & \(j\) is connected, if both are \(k\)-nearest neighbors
- how to obtain weight ?
- ex) Gaussian kernel : \(W_{i j}=\exp \left\{- \mid \mid x_i-x_j \mid \mid ^2 / 2 \sigma^2\right\}\)
- Ex) \(\epsilon\)NN graphs
- \(i\) & \(j\) are connected if \(d(i, j) \leq \epsilon\)
-
(2) Label Propagation
propagates labels of the labeled data \(\rightarrow\) unlabeled data
Assumption :
- data in same manifold \(\rightarrow\) likely to share label
Notation
- \(\hat{Y}\) : \(n \times C\) matrix ( of new classificaiton scores )
- each row : \(\hat{Y}_i\) ( distn over \(C\) classes )
- \(Y\) : \(n \times C\) matrix ( labels for the labeled data )
- each row : \(Y_i\) ( one-hot vector GT )
Loss function for label propagation :
- \(\mathcal{L}=\frac{1}{2} \sum_{i, j=1}^n A_{i j}\left(\hat{y}_i-\hat{y}_j\right)^2=\hat{Y}^T L \hat{Y}\).
- \(L\) = \(D-A\) ( = graph Laplacian matrix )
Probabilistic transition matrix : \(P = D^{-1}A\)
- split into labeld & unlabeled!
\(P=\left(\begin{array}{cc} P_{l l} & P_{l u} \\ P_{u l} & P_{u u} \end{array}\right) \quad Y=\left(\begin{array}{c} Y_l \\ Y_u \end{array}\right) \quad \hat{Y}=\left(\begin{array}{c} \hat{Y}_l \\ \hat{Y}_u \end{array}\right)\).
Optimal solution :
- \(\hat{Y}_l =Y_l\).
- \(\hat{Y}_u =\left(I-P_{u u}\right)^{-1} P_{u l} Y_l\).
Labeling score computation : involves matrix inversion
\(\rightarrow\) computaitonally heavy
\(\rightarrow\) solution : propose an iterative approach to converge to same solution
Iterative Approach
-
procedure
-
step 1) propagate \(\hat{Y} \leftarrow P \hat{Y}\)
-
step 2) preserve labeled data \(\hat{Y}_l=Y_l\)
-
repeat above, until convergence
-
-
loss function :
- \(\mathcal{L}=\frac{1}{2} \sum_{i, j=1}^n A_{i j} \mid \mid \frac{\hat{Y}_i}{\sqrt{D_{i i}}}-\frac{\hat{Y}_i}{\sqrt{D_{j j}}} \mid \mid ^2+(1 / \alpha-1) \sum_{i=1}^n \mid \mid \hat{Y}_i-Y_i \mid \mid ^2\).
- 1st term ) smoothness constraint
- labels that do not change too much between nearby points,
- 2nd term ) fitting constraint
- final labels of the labeled nodes to be similar to their initial value
- 1st term ) smoothness constraint
- \(\mathcal{L}=\frac{1}{2} \sum_{i, j=1}^n A_{i j} \mid \mid \frac{\hat{Y}_i}{\sqrt{D_{i i}}}-\frac{\hat{Y}_i}{\sqrt{D_{j j}}} \mid \mid ^2+(1 / \alpha-1) \sum_{i=1}^n \mid \mid \hat{Y}_i-Y_i \mid \mid ^2\).
-
optimal solution :
-
\(\hat{Y}=(I-\alpha S)^{-1} Y\).
where \(S=D^{-1 / 2} A D^{-1 / 2}\)
-
Iterative Approach (2)
- less computationally expensive
- procedure
- step 1) propagate \(\hat{Y} \leftarrow \alpha S \hat{Y}+(1-\alpha) Y\)
- repeat until convergence
8. Self-Supervision for SSL
trained using pretext task
- exemplar-CNN
- Rotation
- Patches
- Coloriation
- Contrastive Predictive Coding