SCAN : Learning to Classify Images without Labels
Contents
- Abstract
- Introduction
- Method
- RL for semantic clustering
- A semantic clustering loss
- Fine-tuning through self-labeling
0. Abstract
Unsupervised Image Classification
-
automatically group images into semantically meaningful clusters, when GT labels are absent
-
previous works
- (1) end-to-end
- (2) two-step approach ( this paper )
- feature learning & clustering
1. Introduction
Representation Learning
-
use self-supervised learning to generate feature representations
( no need for label )
-
use pre-designed tasks, called pretext tasks
-
(1) two-stage approach
- representation learning : mainly used as the first pretraining stage
- ( second stage = fine-tuning on another task )
(2) end-to-end learning
- combine feature learning & clustering
Proposed work, SCAN
( SCAN = Semantic Clustering by Adopting Nearest neighbors )
- two-step approach
- Leverage the advantage of both
- (1) representation learning
- (2) end-to-end learning
Procedures of SCAN
-
step 1) learn feature representation via pretext task
-
(representation learning) use K-means
\(\rightarrow\) may have cluster degeneracy problem
-
(proposed) mine the nearest neighbors of each image, based on feature similarity
-
-
step 2) integrate semantically meaningful neighbors as prior intoa learnable approach
2. Method
(1) RL for semantic clustering
Notation
-
image dataset : \(\mathcal{D}=\left\{X_1, \ldots, X_{ \mid \mathcal{D} \mid }\right\}\)
-
class label ( absent ) : \(\mathcal{C}\)
\(\rightarrow\) However, we do not have access to class label !
Representation learning
- pretext task : \(\tau\)
- embedding function : \(\phi_{\theta}\)
- image & augmented image : \(X_i\) & \(T[X_i]\)
- minimize …
- \(\min _\theta d\left(\Phi_\theta\left(X_i\right), \Phi_\theta\left(T\left[X_i\right]\right)\right)\).
Conclusion : pretext tasks from RL can be used to obtain semantically meaningful features
(2) A semantic clustering loss
a) Mining nearest negibhors
naively applying K-means to obtained features \(\rightarrow\) lead to cluster degeneracy
[ Setting ]
-
Using pretext-tasks & nearest neighbors (NN) …….
for every sample \(X_i \in \mathcal{D}\), mine its \(K\) neareste neighbors, \(\mathcal{N}_{X_i}\)
Loss Function
Goal : learn a clustering function \(\Phi_\eta\)
- Classifies a sample \(X_i\) & \(\mathcal{N}_{X_i}\) together
- soft assignment over clusters \(\mathcal{C}=\{1, \ldots, C\}\), with \(\Phi_\eta\left(X_i\right) \in [0,1]^C\)
- probability of \(X_i\) assigned to \(c\) : \(\Phi_\eta^c\left(X_i\right)\)
Loss Function : \(\Lambda=-\frac{1}{ \mid \mathcal{D} \mid } \sum_{X \in \mathcal{D}} \sum_{k \in \mathcal{N}_X} \log \left\langle\Phi_\eta(X), \Phi_\eta(k)\right\rangle+\lambda \sum_{c \in \mathcal{C}} \Phi_\eta^{\prime c} \log \Phi_\eta^{\prime c}\)
-
with \(\Phi_\eta^{\prime c}=\frac{1}{ \mid \mathcal{D} \mid } \sum_{X \in \mathcal{D}} \Phi_\eta^c(X) .\)
-
(1st term) correct prediction
-
(2nd term) spreads the prediction across all clusters
( = can be replaced by KL-divergence )
(3) Fine-tuning through self-labeling
-
each sample is combined with \(K \geq 1\) Neighbors…but may have FP (False Positive)
-
experimently observed that samples with high confident predictions (\(p_{max}\approx1\) ) tend to have propor cluster
\(\rightarrow\) regard them as prototypes for each class