Self-Labeling via Simultaneous Clustering and Representation Learning


Contents

  1. Abstract
  2. Introduction
  3. Method
    1. Self-labeling


0. Abstract

combining (1) clustering + (2) representation learning

\(\rightarrow\) doing it naively…leads to degenerate solutions


solution : propose a method, that maximizes the information between labels & input data indicies


1. Introduction

self-supervision tasks : mostly done by new pretext task

But, task of classification is sufficient for pre-training

( of course…. provided that labels are given )

\(\rightarrow\) focus on obtaining the labels automatically ( with self-labeling algorithm )


Degeneration problem ?

\(\rightarrow\) solve by adding the constraint, that the labels must induce an equipartition of the data ( = maximizes the information between data indicies & labels )


2. Method

(1) self-labeling method

(2) interpret the method as optimizing laels & targets of CE loss


(1) Self-labeling

Notation :

  • \(x=\Phi(I)\) : DNN
    • map images (\(I\)) to feature vectors (\(x \in \mathbb{R}^D\) )
  • \(I_1, \ldots, I_N\) : Image data
  • \(y_1, \ldots, y_N \in\{1, \ldots, K\}\) : Image labels
  • \(h: \mathbb{R}^D \rightarrow \mathbb{R}^K\) : classification head
  • \(p\left(y=\cdot \mid \boldsymbol{x}_i\right)=\operatorname{softmax}\left(h \circ \Phi\left(\boldsymbol{x}_i\right)\right)\) : class probabilities


Train model & head parameters, with average CE loss

  • \(E\left(p \mid y_1, \ldots, y_N\right)=-\frac{1}{N} \sum_{i=1}^N \log p\left(y_i \mid \boldsymbol{x}_i\right)\).

\(\rightarrow\) requires labelled dataset

( if not, requires a self-labeling mechanism )


[ Self-labeling mechanism ]

  • achieved by jointly optimizing , w.r.t

    • (1) model \(h \circ \Phi\)
    • (2) labels \(y_1, \ldots, y_N\)
  • but if fully unsupervised …. leads to degenerate solution

    ( = trivially minimized by assigning all data points to a single (arbitrary) label )


Solution?

  • first, encode the labels as posterior distn \(q\left(y \mid \boldsymbol{x}_i\right)\)

    • (Before) \(E\left(p \mid y_1, \ldots, y_N\right)=-\frac{1}{N} \sum_{i=1}^N \log p\left(y_i \mid \boldsymbol{x}_i\right)\).
    • (After) \(E(p, q)=-\frac{1}{N} \sum_{i=1}^N \sum_{y=1}^K q\left(y \mid \boldsymbol{x}_i\right) \log p\left(y \mid \boldsymbol{x}_i\right) .\)

    ( optimizing \(q\) = reassigning labels )

  • to avoid degeneracy…

    \(\rightarrow\) add the constraint that the label assignments must partition the data in equally-sized subsets

  • objective function :

    • \(\min _{p, q} E(p, q) \quad \text { subject to } \quad \forall y: q\left(y \mid \boldsymbol{x}_i\right) \in\{0,1\} \text { and } \sum_{i=1}^N q\left(y \mid \boldsymbol{x}_i\right)=\frac{N}{K}\).

Categories: ,

Updated: