Deep Clustering for Unsupervised Learning of Visual Features

Abstract
Method
1. Preliminaries
2. Unsupervised Learning by Clustering
3. Avoiding Trivial Solutions

0. Abstract

DeepCluster

end-to-end training of visual features
jointly learns….
- (1) the parameters of NN
- (2) cluster assignments of the resulting features
iteratively …
- step 1) groups the features with standard clustering algorithm (k-means)
- step 2) use the subsequent assignments as supervision to update NN

1. Method

(1) Preliminaries

data : \(X=\left\{x_{1}, x_{2}, \ldots, x_{N}\right\}\)

model :

(1) convent mapping : \(f_{\theta}\)
(2) classifier : \(g_{W}\)

Loss function : \(\min _{\theta, W} \frac{1}{N} \sum_{n=1}^{N} \ell\left(g_{W}\left(f_{\theta}\left(x_{n}\right)\right), y_{n}\right)\).

( \(l\) : negative log-softmax )

(2) Unsupervised Learning by Clustering

cluster the output of the convnet &

use the subsequent cluster assignments as “pseudo-labels” to optimize \(\min _{\theta, W} \frac{1}{N} \sum_{n=1}^{N} \ell\left(g_{W}\left(f_{\theta}\left(x_{n}\right)\right), y_{n}\right)\)

( iteratively learns the features & groups them )

K-means

input : \(f_{\theta}\left(x_{n}\right)\)
output : clusters them into \(k\) distinct groups
jointly learns a ..
- (1) \(d \times k\) centroid matrix \(C\)
- (2) cluster assignments \(y_{n}\) of each image \(n\)
loss function :
- \(\min _{C \in \mathbb{R}^{d \times k}} \frac{1}{N} \sum_{n=1}^{N} \min _{y_{n} \in\{0,1\}^{k}} \mid \mid f_{\theta}\left(x_{n}\right)-C y_{n} \mid \mid _{2}^{2} \quad \text { such that } y_{n}^{\top} 1_{k}=1\).

\(\rightarrow\) these assignments are used as pseudo-labels

( do not use centroid matrix \(C\) )

problem : degenerating problem

\(\rightarrow\) cluster them into a single group… solution??

(3) Avoiding Trivial Solutions

solutions are typically based on constraining or penalizing the MINIMAL number of points per cluster

but…not applicable to large scale datasets

a) empty clusters

when cluster becomes empty….

randomly select a non-empty cluster & use its centroid with a small perturbation as a new centroid

b) trivial parameterization

minimizing \(\min _{\theta, W} \frac{1}{N} \sum_{n=1}^{N} \ell\left(g_{W}\left(f_{\theta}\left(x_{n}\right)\right), y_{n}\right)\)

\(\rightarrow\) leads to a trivial parameterization

( = predict the same output, regardless of output )

solution ?

sample images based on a uniform distn over the classes ( or pseudo labels )

( = weighted loss, with weight of inverse of size of clusters )

Twitter Facebook LinkedIn

(paper 22) DeepCluster

Seunghan Lee