Mitigating Embedding and Class Assignment Mismatch in Unsupervised Image ClassificationPermalink
ContentsPermalink
- Abstract
- Introduction
- Model
- Stage 1 : Unsupervsied Deep Embedding
- Stage 2 : Unsupervised Class Assignment with Refining Pretraining Embeddings
0. AbstractPermalink
Unsupervised Image Classification
- latest approach : end-to-end
- unified losses from (1) embedding & (2) class assignment
- have different goals … thus jointly optimizing may lead to suboptimal solutions
Solution : propose a novel two-stage algorithm
- (1) embedding module for pretraining
- (2) refining module for embedding & class assignment
1. IntroductionPermalink
Unsupervised Image Classification
- determine the membership of each data point, as one of the predefined class labels
- 2 methods are used..
- (1) sequential method
- (2) joint method
This paper : two-stage approach
- stage 1) embedding learning
- gather similar data points
- stage 2) refine embedding & assign class
- minimize 2 kinds of loss
- (1) class assignment loss
- (2) embedding loss
- minimize 2 kinds of loss
2. ModelPermalink
Notation
- # of underlying classes : nc
- set of n images : I={x1,x2,…,xn}
(1) Stage 1 : Unsupervised Deep EmbeddingPermalink
-
[GOAL] extract visually essential features
-
adopt Super-AND to initialize encoder
Super-ANDPermalink
employs…
- (1) data augmentation
- (2) entropy-based loss
total of 3 losses
- (1) AND-loss ( Land )
- (2) UE-loss ( Lue ) ….. unification entropy loss
- (3) AUG-loss ( Laug ) ….. augmentation loss
Details
-
considers every data occurence as individual class
-
groups the data points into small clusters
( by discovering the nearest neighbors )
a) AND-lossPermalink
-
considers each neighborhood pair & remaining data as a single class to separate
-
Land =−∑i∈Nlog(∑j∈˜N(xi)∪{i}pji)−∑i∈Nclogpii.
-
N : selected part of the neighborhood pair sets
- Nc : complement of N
- ˜N(xi) : neighbor of i-th image
- pji : probability of i-th image being identified as j-th class
-
b) UE-lossPermalink
- intensifies the concentration effect
- minimizing UE-loss = makes nearby data occurrence attract each other
- Lue=−∑i∑j≠i˜pjilog˜pji.
Jointly optimizing a) & b)
→ enforce overall neighborhoods to be separated, while keeping similar neighbors close.
c) AUG-lossPermalink
-
defined to learn invariant image features
-
Regards augmented images as positive pairs
→ Reduce the discrepancey between original & augmented
-
Laug=−∑i∑j≠ilog(1−¯pji)−∑ilog¯pii.
Total Loss :Permalink
Lstage 1=Land +w(t)×Lue +Laug .
- w(t) : initialized from 0 and increased gradually
(2) Stage 2 : Unsupervised Class Assignment with Refining Pretraining EmbeddingsPermalink
ideal class assignment : requires …
- (1) not only ideal embedding
- (2) but also dense grouping
→ use 2 kinds of loss in Stage 2
- (1) class assignment loss
- (2) consistency preserving loss
Mutual Information-based Class AssignmentPermalink
Mutual Information (MI) :
I(x,y)=DKL(p(x,y)∣∣p(x)p(y))=∑x∈X∑y∈Yp(x,y)logp(x,y)p(x)p(y)=H(x)−H(x∣y).
IIC (Invariant Information Clustering)
-
maximize MI between samples & augmented samples
-
trains the classifier with invariant features from DA
-
procedure
-
[input] image set x & augmented image set g(x)
-
mapping : fθ
-
classifies images & generate probability vector
( y=fθ(x),ˆy=fθ(g(x)) )
-
-
find optimal fθ, that maximizes…
- max.
-
-
by maximizing MI, can prevent clustering degeneracy
Details of MI : I(x, y) =H(x)-H(x \mid y)
- (1) maximize H(y)
- when every data is EVENLY assigned to every cluster
- (2) minimize H(y \mid \hat{y})
- when consistent cluster
Loss Function :
-
joint pdf of y and \hat{y} : matrix \mathbf{P}
( \mathbf{P}=\frac{1}{n} \sum_{i \in \mathcal{B}} f_\theta\left(x_i\right) \cdot f_\theta\left(g\left(x_i\right)\right)^T )
-
L_{a s s i g n}=-\sum_c \sum_{c^{\prime}} \mathbf{P}_{c c^{\prime}} \cdot \log \frac{\mathbf{P}_{c c^{\prime}}}{\mathbf{P}_{c^{\prime}} \cdot \mathbf{P}_c}.
Consistency Preserving on EmbeddingPermalink
add an extra loss term, L_{cp}
Notation
- image : \mathbf{x}_i
- embedding of \mathbf{x}_i : \mathbf{v}_i
- projected to normalized sphere
- \hat{\mathbf{p}}_i^j(i \neq j) : probability of given instance i classified as j-th instance
- \hat{\mathbf{p}}_i^i : probability of being classified as its own i-th augmented instance
Consistency preserving loss L_{cp} : minimizes any mis-classified cases over the batches
- \begin{array}{r} \hat{\mathbf{p}}_i^j=\frac{\exp \left(\mathbf{v}_j^{\top} \mathbf{v}_i / \tau\right)}{\sum_{k=1}^n \exp \left(\mathbf{v}_k^{\top} \mathbf{v}_i / \tau\right)}, \quad \hat{\mathbf{p}}_i^i=\frac{\exp \left(\mathbf{v}_i^{\top} \hat{\mathbf{v}}_i / \tau\right)}{\sum_{k=1}^n \exp \left(\mathbf{v}_k^{\top} \hat{\mathbf{v}}_i / \tau\right)} \\ L_{c p}=-\sum_i \sum_{j \neq i} \log \left(1-\hat{\mathbf{p}}_i^j\right)-\sum_i \log \hat{\mathbf{p}}_i^i \end{array}.
Total Unsupervised Classification Loss :
- L_{\text {stage } 2}=L_{\text {assign }}+\lambda \cdot L_{c p}.
Normalized FC classifierPermalink
Norm-FC classification heads :
- used for the second stage classifier
Predicted value :
- y_i^j=\frac{\exp \left(\frac{\mathbf{w}_j}{ \mid \mid \mathbf{w}_j \mid \mid } \cdot \mathbf{v}_i / \tau_c\right)}{\sum_k \exp \left(\frac{\mathbf{w}_k}{ \mid \mid \mathbf{w}_k \mid \mid } \cdot \mathbf{v}_i / \tau_c\right)}.