CoCa: Contrastive Captioners are Image-Text Foundation Models

https://arxiv.org/pdf/2205.01917


1. Abstract

Contrastive Captioner (CoCa)

  • Minimalist design to pretrain an image-text encoder-decoder foundation model
  • Joint training:
    • (1) Contrastive loss ( feat. CLIP )
    • (2) Captioning loss ( feat. SimVLM )
  • CoCa vs. Standard
    • Standard standard encoder-decoder transformers: All decoder layers attend to encoder outputs
    • CoCa: Omits cross-attention in the first half of decoder layers to encode unimodal text representations
  • Pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text


figure2


2. CoCa (Contrastive Captioner)

Review of 3 foundation model families

( that utilize natural language supervision differently )

  • (1) Single-encoder classification pretraining
  • (2) Dual-encoder contrastive learning
  • (3) Encoder-decoder image captioning


Contrastive Captioners (CoCa)

  • Both contrastive learning & image-to-caption generation
  • Simple architecture


(1) Natural Language Supervision

a) Single-encoder classification pretraining

\(\mathcal{L}_{\mathrm{Cls}}=-p(y) \log q_\theta(x)\).


b) Dual-encoder contrastive learning

\(\mathcal{L}_{\mathrm{Con}}=-\frac{1}{N}(\underbrace{\sum_i^N \log \frac{\exp \left(x_i^{\top} y_i / \sigma\right)}{\sum_{j=1}^N \exp \left(x_i^{\top} y_j / \sigma\right)}}_{\text {image-to-text }}+\underbrace{\left.\sum_i^N \log \frac{\exp \left(y_i^{\top} x_i / \sigma\right)}{\sum_{j=1}^N \exp \left(y_i^{\top} x_j / \sigma\right)}\right)}_{\text {text-to-image }}\).


c) Encoder-decoder image captioning

\(\mathcal{L}_{\text {Cap }}=-\sum_{t=1}^T \log P_\theta\left(y_t \mid y_{<t}, x\right)\).


(2) Contrastive Captioners Pretraining

figure2


Contrastive captioner (CoCa)

  • A simple encoder-decoder approach
  • Combines the (above) 3 training paradigms


Details

  • (First half) Omits cross-attention in the first half of the decoder layers

    \(\rightarrow\) To encode unimodal text representations

  • (Last half) Cascades the rest of the decoder layers

    \(\rightarrow\) Cross-attending to the image encoder for multimodal image-text representations.

\(\rightarrow\) CoCa decoder simultaneously produces both unimodal & multimodal text representations!


Loss function:

  • \(\mathcal{L}_{\mathrm{CoCa}}=\lambda_{\mathrm{Con}} \cdot \mathcal{L}_{\mathrm{Con}}+\lambda_{\mathrm{Cap}} \cdot \mathcal{L}_{\mathrm{Cap}}\).


a) Decoupled Text Decoder & CoCa Architecture

Captioning approach

  • Optimizes the conditional likelihood of text

Contrastive approach

  • Uses an unconditional text representation

\(\rightarrow\) How to combine these two?


Solution: Propose a simple “decoupled decoder” design

  • How? Split the decoder into unimodal and multimodal components
    • By skipping the cross-attention mechanism in the unimodal decoder layers


Split decoders into two parts!

  • (1) Bottom \(n_{\text {uni }}\) unimodal decoder layers:
    • Encode the input text as latent vectors with causally-masked self-attention
  • (2) Top \(n_{\text {multi }}\) multimodal layers:
    • Apply causally-masked self-attention & cross-attention to the output of the visual encoder


figure2


b) Attentional Poolers

Two types loss & embeddings

  • (1) Contrastive loss: Uses a single embedding for each image!

  • (2) Captioning loss: Decoder usually attends to a sequence of image output tokens in an encoder-decoder captioner


Single & Multiple embeddings

  • (Single) Pooled image embedding
    • Helps visual recognition tasks as a global representation
  • (Multiple) More visual tokens (thus more fine-grained)
    • Beneficial for multimodal understanding tasks which require region-level features


Task-specific attentional pooling (for global representation)

  • To be used for different types of training objectives and downstream tasks
  • Pooler = Single multi-head attention layer
    • (Q) \(n_{\text {query }}\) learnable queries
    • (K,V) Encoder outputs

\(\rightarrow\) Can learn to pool embeddings with different lengths for the two training objectives


(3) CoCa for Downstream Tasks

a) Zero-shot Transfer

Leverage both image and text inputs

Tasks

  • Zero-shot image classification
  • Zero-shot image-text cross-retrieval
  • Zero-shot video-text cross-retrieval


b) Frozen-feature Evaluation.

CoCa adopts task-specific attentional pooling (pooler) to customize visual representations for different types downstream tasks

\(\rightarrow\) Enables the model to obtain strong performance as a frozen encoder where we “only learn a new pooler to aggregate features”


c) CoCa for Video Action Recognition

figure2

Input = Mmultiple frames of a video

Process

  • Step 1) Feed each frame into the shared image encoder individually

  • Step 2) (For frozen feature evaluation or finetuning) Learn an additional pooler on top of the spatial and temporal feature tokens with a softmax cross-entropy loss

    • Note: Pooler has a single query token

      \(\rightarrow\) Computation of pooling over all spatial and temporal tokens is not expensive!