An Overview of Deep Semi-Supervised Learning (2020) - Part 4


Contents

  1. Abstract
  2. Introduction
    1. SSL
    2. SSL Methods
    3. Main Assumptions in SSL
    4. Related Problems
  3. Consistency Regularization
    1. Ladder Networks
    2. Pi-Model
    3. Temporal Ensembling
    4. Mean Teachers
    5. Dual Students
    6. Fast-SWA
    7. Virtual Adversarial Training (VAT)
    8. Adversarial Dropout (AdD)
    9. Interpolation Consistency Training (ICT)
    10. Unsupervised Data Augmentation
  4. Entropy Minimization
  5. Proxy-label Methods
    1. Self-training
    2. Multi-view Training
  6. Holistic Methods
    1. MixMatch
    2. ReMixMatch
    3. FixMatch
  7. Generative Models
    1. VAE for SSL
    2. GAN for SSL
  8. Graph-Based SSL
    1. Graph Construction
    2. Label Propagation
  9. Self-Supervision for SSL


7. Graph-Based SSL

Setting : each data point \(x_i\) : labeled or unlabeled

Notation : \(G(V, E)\)

  • node : \(V=\left\{x_1, \ldots, x_n\right\}\)
  • edge : \(E=\left\{e_{i j}\right\}_{i, j=1}^n\)
  • adjacency matrix : \(A\)
    • each element as a non-negative weight
    • if not connected … \(A_{i j}=0\)


Graph-based tasks

  • (1) node classification
  • (2) link prediction
  • (3) clustering
  • (4) visualization


Graph methods

  • (1) transductive
    • only capable of producing labels assignments of the examples seen during training
  • (2) inductive
    • more generalizable, and can be transferred and applied to unseen examples


will discuss node classification approaches

can be grouped into…

  • (1) methods which propagate the labels from labeled to unlabeled

    • assumption : nearby nodes tend to have the same labels
  • (2) methods which learn node embeddings

    • assumption : nearby nodes should have similar embeddings in vector space

    & apply classifiers on the learned embeddings


(1) Graph Construction

we first need a graph ! graph can be represented as…

  • (1) adjacency matrix \(A\)
  • (2) constructed to reflect the similarity of the nodes


In case we have limited domain knowledge…

\(\rightarrow\) create graphs!

  • (1) Fully connected graphs

    • fully connected with weighted edges between all pairs of data
    • high computational costs
  • (2) Sparse graphs

    • each node is only connected to a few similar nodes

      ( connections to dissimilar nodes are removed )

    • Ex) kNN graphs
      • \(i\) & \(j\) is connected, if both are \(k\)-nearest neighbors
      • how to obtain weight ?
        • ex) Gaussian kernel : \(W_{i j}=\exp \left\{- \mid \mid x_i-x_j \mid \mid ^2 / 2 \sigma^2\right\}\)
    • Ex) \(\epsilon\)NN graphs
      • \(i\) & \(j\) are connected if \(d(i, j) \leq \epsilon\)


(2) Label Propagation

propagates labels of the labeled data \(\rightarrow\) unlabeled data

Assumption :

  • data in same manifold \(\rightarrow\) likely to share label


Notation

  • \(\hat{Y}\) : \(n \times C\) matrix ( of new classificaiton scores )
    • each row : \(\hat{Y}_i\) ( distn over \(C\) classes )
  • \(Y\) : \(n \times C\) matrix ( labels for the labeled data )
    • each row : \(Y_i\) ( one-hot vector GT )


Loss function for label propagation :

  • \(\mathcal{L}=\frac{1}{2} \sum_{i, j=1}^n A_{i j}\left(\hat{y}_i-\hat{y}_j\right)^2=\hat{Y}^T L \hat{Y}\).
    • \(L\) = \(D-A\) ( = graph Laplacian matrix )


Probabilistic transition matrix : \(P = D^{-1}A\)

  • split into labeld & unlabeled!

\(P=\left(\begin{array}{cc} P_{l l} & P_{l u} \\ P_{u l} & P_{u u} \end{array}\right) \quad Y=\left(\begin{array}{c} Y_l \\ Y_u \end{array}\right) \quad \hat{Y}=\left(\begin{array}{c} \hat{Y}_l \\ \hat{Y}_u \end{array}\right)\).


Optimal solution :

  • \(\hat{Y}_l =Y_l\).
  • \(\hat{Y}_u =\left(I-P_{u u}\right)^{-1} P_{u l} Y_l\).


Labeling score computation : involves matrix inversion

\(\rightarrow\) computaitonally heavy

\(\rightarrow\) solution : propose an iterative approach to converge to same solution


Iterative Approach

  • procedure

    • step 1) propagate \(\hat{Y} \leftarrow P \hat{Y}\)

    • step 2) preserve labeled data \(\hat{Y}_l=Y_l\)

    • repeat above, until convergence

  • loss function :

    • \(\mathcal{L}=\frac{1}{2} \sum_{i, j=1}^n A_{i j} \mid \mid \frac{\hat{Y}_i}{\sqrt{D_{i i}}}-\frac{\hat{Y}_i}{\sqrt{D_{j j}}} \mid \mid ^2+(1 / \alpha-1) \sum_{i=1}^n \mid \mid \hat{Y}_i-Y_i \mid \mid ^2\).
      • 1st term ) smoothness constraint
        • labels that do not change too much between nearby points,
      • 2nd term ) fitting constraint
        • final labels of the labeled nodes to be similar to their initial value
  • optimal solution :

    • \(\hat{Y}=(I-\alpha S)^{-1} Y\).

      where \(S=D^{-1 / 2} A D^{-1 / 2}\)


Iterative Approach (2)

  • less computationally expensive
  • procedure
    • step 1) propagate \(\hat{Y} \leftarrow \alpha S \hat{Y}+(1-\alpha) Y\)
    • repeat until convergence


8. Self-Supervision for SSL

trained using pretext task

  • exemplar-CNN
  • Rotation
  • Patches
  • Coloriation
  • Contrastive Predictive Coding


Categories:

Updated: