Momentum Contrast for Unsupervised Visual Representation Learning
Contents
- 
    Abstract 
- 
    Introduction 
- 
    Related Works - Loss Functions
- Pretext Tasks
 
- 
    Method - 
        Contrastive Learning as Dictionary Look-up 
- 
        Momentum Contrast 
- 
        Pretext Task 
 
- 
        
- 
    Pseudocode 
- 
    Experiment 
0. Abstract
MoCo ( Momentum Contrast )
- 
    UNsupervised visual representation learning 
- 
    dictionary look-up perspective \(\rightarrow\) build a dynamic dictionary ( with a queue = FIFO ) 
- 
    moving-averaged encoder 
1. Introduction
Previous works on unsupervised learning
- mostly on NLP
- CV : mostly on supervised pre-training…
Recent studies on UNSUPERVISED visual representation :
- 
    related to contrastive loss ( = can be thought of as building dynamic dictionaries ) - keys : sampled from data \(\rightarrow\) passed to encoder
 
Desirable to build dictionaries that are..
- (1) large
- (2) consistent
Propose MoCo
- 
    as a way to build large & consistent dictionaries ( dictionary = queue ) 
- 
    unsupervised learning with a contrastive loss 
- 
    2 encoders - (1) key encoder
- (2) query encoder
 
- 
    slowly progressing key encoder - 
        momentum-based MA of query encoder ( to maintain consistency ) 
 
- 
        
2. Related Works
2 aspects of un/semi-supervised learning
- (1) pretext tasks
- (2) loss functions
(1) Pretext tasks
- 
    task being solved is not of interest ( but is solved only for true purpose of learning a good representation ) 
(2) Loss functions
- can be investigated independently of pretext tasks
(1) Loss Functions
Contrastive Losses
- 
    measure the similarities of sample pairs ( instead of matching true & predicted ) 
- 
    core of several un/semi-supervised learning tasks 
Adversarial Losses
- measures the difference between pdfs
- widely used in unsupervised data generation
(2) Pretext Tasks
examples
- recovering the input under some corrpution
- form pseudo-labels by..
    - transformations of a single image
- patch orderings
- tracking
- sgementing objects in videos..
 
3. Method

(1) Contrastive Learning as Dictionary Look-up
Contrastive Learing
- can be thought of as training an ENCODER of a DICTIONARY LOOK-UP task
Notation
- 
    encoded query : \(q\) 
- 
    keys of a dictionary : - 
        set of encoded samples : \(\{ k_0, k_1, k_2, \cdots \}\) 
- positive key : \(k_{+}\)
- negative key : \(k_{-}\)
 
- 
        
Contrastive Loss : low , when…
- 
    \(q\) is similar to its positive key \(k_{+}\) 
- 
    dissimilar to other key ( = negative keys ) 
( similarity : measured by dot product )
\(\rightarrow\) InfoNCE
InfoNCE
\(\mathcal{L}_{q}=-\log \frac{\exp \left(q \cdot k_{+} / \tau\right)}{\sum_{i=0}^{K} \exp \left(q \cdot k_{i} / \tau\right)}\).
- \(\tau\) : temperature
\(\rightarrow\) sum over one positive & K negatives
(= log loss of \((K+1)\) way softmax classifier , that tries to classify \(q\) as \(k_{+}\) )
Model Notation
- query : \(q=f_{\mathrm{q}}\left(x^{q}\right)\)
    - \(f_q\) : encoder network
- \(x^{q}\) : query
 
- key : \(k=f_{\mathrm{k}}\left(x^{k}\right)\)
    - \(f_k\) : encoder network
- \(x^{k}\) : key
 
\(\rightarrow\) networks \(f_{\mathrm{a}}\) and \(f_{\mathrm{k}}\) can be …
- (1) identical
- (2) partially shared
- (3) different
(2) Momentum Contrast
Contrastive Learning
= building a discrete dictionary on high-dim continuous inputs
DYNAMIC dictionary
= keys are RANDOMLY sampled
= key encoder evolves during training
Good features can be learend by….
- 
    (1) large dictionary that convers rich set of NEGATIVE samples 
- 
    (2) key encoder is kept consistent as possible, despites its evolution 
\(\rightarrow\) propose MoCo ( = Momentum Contrast )
a) Dictionary as a queue
maintain dictionary as queue ( = FIFO )
- Current minibatch : enqueued
- Oldest mini-batch : removed
b) Momentum update
- 
    copy the key encoder \(f_k\) from the query encoder \(f_q\) ( ignore the gradient! ) \(\rightarrow\) failure! due to rapidly changing encoder, that reduces the key representations’ consistency 
\(\rightarrow\) thus, propose MOMENTUM UPDATE
\(\theta_{\mathrm{k}} \leftarrow m \theta_{\mathrm{k}}+(1-m) \theta_{\mathrm{q}}\).
- \(m \in[0,1)\) : momentum coefficient
- Only the parameters \(\theta_{\mathrm{q}}\) are updated!!
\(\rightarrow\) keys in the queue are encoded by different encoders ( with small diference )
\(\rightarrow\) large momentum (e.g., \(m=0.999\), our default) works much better than a smaller value (e.g., \(m=0.9)\)
KEY : slowly evolving key encoder
c) Relations to previous mechanisms
Moco = general mechansim for using contrastive loss
compare with 2 existing methods
- (1) end-to-end
- (2) memory bank

(3) Pretext Task
use instance discrimination task
4. Psudocode

5. Experiment
Dataset :
- ImageNet-1M (IN-1M)
- 
    ~ 1.28 million images & 1,000 classes ( but does not use class labels ) 
- 
    characteristics - iconic images
 
- Instagram-1B (IG-1B)
- ~ 1 billion images from Instagram & ~ 1,500 hastags
- characteristics
    - uncurated
- long-tailed & unbalanced sitn
- iconic & scene-level image
 
Training
- optimizer : SGD ( w.d = 0.0001 , momentum = 0.9 )
