CoCa: Contrastive Captioners are Image-Text Foundation Models

https://arxiv.org/pdf/2205.01917

1. Abstract

Contrastive Captioner (CoCa)

Minimalist design to pretrain an image-text encoder-decoder foundation model
Joint training:
- (1) Contrastive loss ( feat. CLIP )
- (2) Captioning loss ( feat. SimVLM )
CoCa vs. Standard
- Standard standard encoder-decoder transformers: All decoder layers attend to encoder outputs
- CoCa: Omits cross-attention in the first half of decoder layers to encode unimodal text representations
Pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text

2. CoCa (Contrastive Captioner)

Review of 3 foundation model families

( that utilize natural language supervision differently )

(1) Single-encoder classification pretraining
(2) Dual-encoder contrastive learning
(3) Encoder-decoder image captioning

Contrastive Captioners (CoCa)

Both contrastive learning & image-to-caption generation
Simple architecture

(1) Natural Language Supervision

a) Single-encoder classification pretraining

\(\mathcal{L}_{\mathrm{Cls}}=-p(y) \log q_\theta(x)\).

b) Dual-encoder contrastive learning

\(\mathcal{L}_{\mathrm{Con}}=-\frac{1}{N}(\underbrace{\sum_i^N \log \frac{\exp \left(x_i^{\top} y_i / \sigma\right)}{\sum_{j=1}^N \exp \left(x_i^{\top} y_j / \sigma\right)}}_{\text {image-to-text }}+\underbrace{\left.\sum_i^N \log \frac{\exp \left(y_i^{\top} x_i / \sigma\right)}{\sum_{j=1}^N \exp \left(y_i^{\top} x_j / \sigma\right)}\right)}_{\text {text-to-image }}\).

c) Encoder-decoder image captioning

\(\mathcal{L}_{\text {Cap }}=-\sum_{t=1}^T \log P_\theta\left(y_t \mid y_{<t}, x\right)\).

(2) Contrastive Captioners Pretraining

Contrastive captioner (CoCa)

A simple encoder-decoder approach
Combines the (above) 3 training paradigms

Details

(First half) Omits cross-attention in the first half of the decoder layers

\(\rightarrow\) To encode unimodal text representations
(Last half) Cascades the rest of the decoder layers

\(\rightarrow\) Cross-attending to the image encoder for multimodal image-text representations.

\(\rightarrow\) CoCa decoder simultaneously produces both unimodal & multimodal text representations!

Loss function:

\(\mathcal{L}_{\mathrm{CoCa}}=\lambda_{\mathrm{Con}} \cdot \mathcal{L}_{\mathrm{Con}}+\lambda_{\mathrm{Cap}} \cdot \mathcal{L}_{\mathrm{Cap}}\).

a) Decoupled Text Decoder & CoCa Architecture

Captioning approach

Optimizes the conditional likelihood of text

Contrastive approach

Uses an unconditional text representation

\(\rightarrow\) How to combine these two?

Solution: Propose a simple “decoupled decoder” design

How? Split the decoder into unimodal and multimodal components
- By skipping the cross-attention mechanism in the unimodal decoder layers

Split decoders into two parts!

(1) Bottom \(n_{\text {uni }}\) unimodal decoder layers:
- Encode the input text as latent vectors with causally-masked self-attention
(2) Top \(n_{\text {multi }}\) multimodal layers:
- Apply causally-masked self-attention & cross-attention to the output of the visual encoder

b) Attentional Poolers

Two types loss & embeddings

(1) Contrastive loss: Uses a single embedding for each image!
(2) Captioning loss: Decoder usually attends to a sequence of image output tokens in an encoder-decoder captioner

Single & Multiple embeddings

(Single) Pooled image embedding
- Helps visual recognition tasks as a global representation
(Multiple) More visual tokens (thus more fine-grained)
- Beneficial for multimodal understanding tasks which require region-level features

Task-specific attentional pooling (for global representation)

To be used for different types of training objectives and downstream tasks
Pooler = Single multi-head attention layer
- (Q) \(n_{\text {query }}\) learnable queries
- (K,V) Encoder outputs

\(\rightarrow\) Can learn to pool embeddings with different lengths for the two training objectives

(3) CoCa for Downstream Tasks

a) Zero-shot Transfer

Leverage both image and text inputs

Tasks

Zero-shot image classification
Zero-shot image-text cross-retrieval
Zero-shot video-text cross-retrieval

b) Frozen-feature Evaluation.

CoCa adopts task-specific attentional pooling (pooler) to customize visual representations for different types downstream tasks

\(\rightarrow\) Enables the model to obtain strong performance as a frozen encoder where we “only learn a new pooler to aggregate features”

c) CoCa for Video Action Recognition

Input = Mmultiple frames of a video

Process

Step 1) Feed each frame into the shared image encoder individually
Step 2) (For frozen feature evaluation or finetuning) Learn an additional pooler on top of the spatial and temporal feature tokens with a softmax cross-entropy loss
- Note: Pooler has a single query token
  
  \(\rightarrow\) Computation of pooling over all spatial and temporal tokens is not expensive!

Twitter Facebook LinkedIn

CoCa; Contrastive Captioners are Image-Text Foundation Models

Seunghan Lee