Vision-Language Models for Vision Tasks: A Survey

https://arxiv.org/pdf/2304.00685


Contents

  • (3) VLM Foundations
    • VLM Network Architectures
      • For image
      • For text
    • VLM Pretraining Objectives
      • Contrastive objectives
      • Generative objectives
      • Alignment objectives
    • VLM Pretraining Frameworks
    • Evaluation Setups & Downstream Tasks
  • (4) Datasets
    • Pretraining
    • Evaluation


3. VLM Foundations

VLM pre-training

  • Aims to pretrain a VLM to learn image-text correlation
  • For effective zero-shot predictions


Procedure

  • Step 1) Text encoder & Image encoder
    • To extract image and text features
  • Step 2) Pre-training objectives
    • Learns the vision-language correlation
  • Step 3) Evaluation
    • On unseen data in a zero-shot manner


This section: Foundations of VLM pre-training

  • a) Network architectures
    • For extracting image and text features
  • b) Pre-training objectives
    • For modelling vision-language correlation
  • c) Frameworks
    • For VLM pre-training
  • d) Downstream tasks
    • For VLM evaluations


(1) Network Architectures

Notation

  • Dataset \(\mathcal{D}=\left\{x_n^I, x_n^T\right\}_{n=1}^N\),
  • Image encoder \(f_{\theta}\) ,
    • Image embedding \(z_{\mathrm{n}}^I=f_\theta\left(x_n^I\right)\)
  • Text encoder \(f_{\phi}\),
    • Text embedding \(z_n^T=f_\phi\left(x_n^T\right)\),


P1) For Image

  • (1) CNN-based (e.g., ResNet)

  • (2) Transformer-based (e.g., ViT)


P2) For Text

Most VLM studies (e.g., CLIP):

  • Employs Transformer ( + with minor modifications )


(2) Pretraining Objectives

Three categories:

  • (1) Contrastive objectives
  • (2) Generative objectives
  • (3) Alignment objectives


P1) Contrastive Objectives

a) Image CL

  • InfoNCE and its variants
  • \(\mathcal{L}_I^{\text{InfoNCE}}=-\frac{1}{B} \sum_{i=1}^B \log \frac{\exp \left(z_i^I \cdot z_{+}^I / \tau\right)}{\sum_{j=1, j \neq i}^{B+1} \exp \left(z_i^I \cdot z_j^I / \tau\right)}\).
    • \(z_i^I\) : Query embedding
    • \(\left\{z_j^I\right\}_{j=1, j \neq 1}^{B+1}\) : Key embeddings
    • \(z_{+}^I\): \(z_i^I\) ‘s positive key ( rest: negative keys )


b) Image-Text CL

  • Symmetrical image-text InfoNCE loss

    ( i.e., \(\mathcal{L}_{\text {InfoNCE }}^{\prime }=\) \(\mathcal{L}_{I \rightarrow T}+\mathcal{L}_{T \rightarrow I}\) )

    • \(\mathcal{L}_{I \rightarrow T}\) : Contrasts the (query image & text keys)
    • \(\mathcal{L}_{T \rightarrow I}\): Contrasts the (query text & image keys)
  • \(\begin{aligned} & \mathcal{L}_{I \rightarrow T}=-\frac{1}{B} \sum_{i=1}^B \log \frac{\exp \left(z_i^I \cdot z_i^T / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^I \cdot z_j^T / \tau\right)}, \\ & \mathcal{L}_{T \rightarrow I}=-\frac{1}{B} \sum_{i=1}^B \log \frac{\exp \left(z_i^T \cdot z_i^I / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^T \cdot z_j^I / \tau\right)} . \end{aligned}\).


c) Image-Text-Label CL

  • SupCon + Image-text CL

    (i.e., \(\mathcal{L}_{\text {infoNCE }}^{I T L}=\mathcal{L}_{I \rightarrow T}^{I T L}+\mathcal{L}_{T \rightarrow I}^{I T L}\) )

  • \(\begin{aligned} & \mathcal{L}_{I \rightarrow T}^{I T L}=-\sum_{i=1}^B \frac{1}{ \mid \mathcal{P}(i) \mid } \sum_{k \in \mathcal{P}(i)} \log \frac{\exp \left(z_i^I \cdot z_k^T / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^I \cdot z_j^T / \tau\right)}, \\ & \mathcal{L}_{T \rightarrow I}^{I T L}=-\sum_{i=1}^B \frac{1}{ \mid \mathcal{P}(i) \mid } \sum_{k \in \mathcal{P}(i)} \log \frac{\exp \left(z_i^T \cdot z_k^I / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^T \cdot z_j^I / \tau\right)}, \end{aligned}\).

    • where \(k \in \mathcal{P}(i)=\left\{k \mid k \in B, y_k=y_i\right\}\)


P2) Generative Objectives

a) Masked Image Modeling

  • \(\mathcal{L}_{M I M}=-\frac{1}{B} \sum_{i=1}^B \log f_\theta\left(\bar{x}_i^I \mid \hat{x}_i^I\right)\).


b) Masked Language Modeling

  • \(\mathcal{L}_{M L M}=-\frac{1}{B} \sum_{i=1}^B \log f_o\left(\vec{x}_i^T \mid \hat{x}_i^T\right)\).


c) Masked Cross-Modal Modeling

  • MCM = MIM + MLM
  • \(\mathcal{L}_{M C M}=-\frac{1}{B} \sum_{i=1}^B\left[\log f_\theta\left(\bar{x}_i^I \mid \hat{x}_i^I, \hat{x}_i^T\right)+\log f_\phi\left(\bar{x}_i^T \mid \hat{x}_i^I, \hat{x}_i^T\right)\right]\).
  • Details)
    • Step 1) Given an “image-text pair”
    • Step 2) Randomly masks
      • a subset of “image” patches
      • a subset of “text” tokens
    • Step 3) Learns to reconstruct them conditioned on ..
      • unmasked “image” patches
      • unmasked “text” tokens


d) Image-to-Text Generation

  • Aims to predict text \(x^T\) autoregressively
    • based on the image paired with \(x^T\)
  • \(\mathcal{L}_{I T G}=-\sum_{l=1}^L \log f_\theta\left(x^T \mid x_{<l,}^T z^I\right)\).
    • \(L\) : # of tokens to be predicted


P3) Alignment Objectives

Align the image-text pair via ..

  • (1) (Global) Image-text matching
  • (2) (Local) Region-word matching

on the embedding space.


a) Image-Text Matching

  • Models global correlation
    • (between “images” and “texts”)
  • \(\mathcal{L}_{I T}=p \log \mathcal{S}\left(z^I, z^T\right)+(1-p) \log \left(1-\mathcal{S}\left(z^I, z^T\right)\right)\).
    • Score function \(\mathcal{S}(\cdot)\) : Measures the alignment probability between the image and text
    • \(p=1\) if paired and 0 otherwise


b) Region-Word Matching

  • Model local cross-modal correlation
    • (between “image regions” and “words”)
  • For dense visual recognition tasks
    • e.g., Object detection
  • \(\mathcal{L}_{R W}=p \log \mathcal{S}^r\left(r^I, w^T\right)+(1-p) \log \left(1-\mathcal{S}^r\left(r^I, w^T\right)\right)\).
    • \(\left(r^I, w^T\right)\) : Region-word pair
    • \(p=1\) if paired and 0 otherwise


(3) VLM Pretraining Frameworks

figure2

  • Two-tower framework

    • Encoded with two separate encoders
  • Two-leg framework

    • Introduces additional multi-modal fusion layers
  • One-tower VLMs

    • Unify vision and language learning in a single encoder

      \(\rightarrow\) Aming to facilitate efficient communications across data modalities


(4) Evaluation Setups and Downstream Tasks

P1) Zero-shot Prediction

Common way of evaluating VLMs’ generalization capability


a) Image Classification

  • What? Aims to classify Image
  • How? By comparing the embeddings of images and texts
    • where “prompt engineering” is often employed to generate task-related prompts like “a photo of a [label].”


b) Semantic Segmentation

  • What? Assign a category label to each pixel in images
  • How? By comparing the embeddings of the given image pixels and texts


c) Object Detection

  • What? Localize and classify objects in images
  • How? By comparing the embeddings of the given object proposals and texts


d) Image-Text Retrieval

  • What? Retrieve the demanded samples from one modality given the cues from another modality
  • Text-to-image retrieval & Image-to-text retrieval


P2) Linear Probing

pass


4. Datasets

(1) Pretraining

figure2


(2) Evaluation

figure2