Vision-Language Models for Vision Tasks: A Survey

https://arxiv.org/pdf/2304.00685

(3) VLM Foundations
- VLM Network Architectures
  - For image
  - For text
- VLM Pretraining Objectives
  - Contrastive objectives
  - Generative objectives
  - Alignment objectives
- VLM Pretraining Frameworks
- Evaluation Setups & Downstream Tasks
(4) Datasets
- Pretraining
- Evaluation

3. VLM Foundations

VLM pre-training

Aims to pretrain a VLM to learn image-text correlation
For effective zero-shot predictions

Procedure

Step 1) Text encoder & Image encoder
- To extract image and text features
Step 2) Pre-training objectives
- Learns the vision-language correlation
Step 3) Evaluation
- On unseen data in a zero-shot manner

This section: Foundations of VLM pre-training

a) Network architectures
- For extracting image and text features
b) Pre-training objectives
- For modelling vision-language correlation
c) Frameworks
- For VLM pre-training
d) Downstream tasks
- For VLM evaluations

(1) Network Architectures

Notation

Dataset \(\mathcal{D}=\left\{x_n^I, x_n^T\right\}_{n=1}^N\),
Image encoder \(f_{\theta}\) ,
- Image embedding \(z_{\mathrm{n}}^I=f_\theta\left(x_n^I\right)\)
Text encoder \(f_{\phi}\),
- Text embedding \(z_n^T=f_\phi\left(x_n^T\right)\),

P1) For Image

(1) CNN-based (e.g., ResNet)
(2) Transformer-based (e.g., ViT)

P2) For Text

Most VLM studies (e.g., CLIP):

Employs Transformer ( + with minor modifications )

(2) Pretraining Objectives

Three categories:

(1) Contrastive objectives
(2) Generative objectives
(3) Alignment objectives

P1) Contrastive Objectives

a) Image CL

InfoNCE and its variants
\(\mathcal{L}_I^{\text{InfoNCE}}=-\frac{1}{B} \sum_{i=1}^B \log \frac{\exp \left(z_i^I \cdot z_{+}^I / \tau\right)}{\sum_{j=1, j \neq i}^{B+1} \exp \left(z_i^I \cdot z_j^I / \tau\right)}\).
- \(z_i^I\) : Query embedding
- \(\left\{z_j^I\right\}_{j=1, j \neq 1}^{B+1}\) : Key embeddings
- \(z_{+}^I\): \(z_i^I\) ‘s positive key ( rest: negative keys )

b) Image-Text CL

Symmetrical image-text InfoNCE loss

( i.e., \(\mathcal{L}_{\text {InfoNCE }}^{\prime }=\) \(\mathcal{L}_{I \rightarrow T}+\mathcal{L}_{T \rightarrow I}\) )
- \(\mathcal{L}_{I \rightarrow T}\) : Contrasts the (query image & text keys)
- \(\mathcal{L}_{T \rightarrow I}\): Contrasts the (query text & image keys)
\(\begin{aligned} & \mathcal{L}_{I \rightarrow T}=-\frac{1}{B} \sum_{i=1}^B \log \frac{\exp \left(z_i^I \cdot z_i^T / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^I \cdot z_j^T / \tau\right)}, \\ & \mathcal{L}_{T \rightarrow I}=-\frac{1}{B} \sum_{i=1}^B \log \frac{\exp \left(z_i^T \cdot z_i^I / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^T \cdot z_j^I / \tau\right)} . \end{aligned}\).

c) Image-Text-Label CL

SupCon + Image-text CL

(i.e., \(\mathcal{L}_{\text {infoNCE }}^{I T L}=\mathcal{L}_{I \rightarrow T}^{I T L}+\mathcal{L}_{T \rightarrow I}^{I T L}\) )
\(\begin{aligned} & \mathcal{L}_{I \rightarrow T}^{I T L}=-\sum_{i=1}^B \frac{1}{ \mid \mathcal{P}(i) \mid } \sum_{k \in \mathcal{P}(i)} \log \frac{\exp \left(z_i^I \cdot z_k^T / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^I \cdot z_j^T / \tau\right)}, \\ & \mathcal{L}_{T \rightarrow I}^{I T L}=-\sum_{i=1}^B \frac{1}{ \mid \mathcal{P}(i) \mid } \sum_{k \in \mathcal{P}(i)} \log \frac{\exp \left(z_i^T \cdot z_k^I / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^T \cdot z_j^I / \tau\right)}, \end{aligned}\).
- where \(k \in \mathcal{P}(i)=\left\{k \mid k \in B, y_k=y_i\right\}\)

P2) Generative Objectives

a) Masked Image Modeling

\(\mathcal{L}_{M I M}=-\frac{1}{B} \sum_{i=1}^B \log f_\theta\left(\bar{x}_i^I \mid \hat{x}_i^I\right)\).

b) Masked Language Modeling

\(\mathcal{L}_{M L M}=-\frac{1}{B} \sum_{i=1}^B \log f_o\left(\vec{x}_i^T \mid \hat{x}_i^T\right)\).

c) Masked Cross-Modal Modeling

MCM = MIM + MLM
\(\mathcal{L}_{M C M}=-\frac{1}{B} \sum_{i=1}^B\left[\log f_\theta\left(\bar{x}_i^I \mid \hat{x}_i^I, \hat{x}_i^T\right)+\log f_\phi\left(\bar{x}_i^T \mid \hat{x}_i^I, \hat{x}_i^T\right)\right]\).
Details)
- Step 1) Given an “image-text pair”
- Step 2) Randomly masks
  - a subset of “image” patches
  - a subset of “text” tokens
- Step 3) Learns to reconstruct them conditioned on ..
  - unmasked “image” patches
  - unmasked “text” tokens

d) Image-to-Text Generation

Aims to predict text \(x^T\) autoregressively
- based on the image paired with \(x^T\)
\(\mathcal{L}_{I T G}=-\sum_{l=1}^L \log f_\theta\left(x^T \mid x_{<l,}^T z^I\right)\).
- \(L\) : # of tokens to be predicted

P3) Alignment Objectives

Align the image-text pair via ..

(1) (Global) Image-text matching
(2) (Local) Region-word matching

on the embedding space.

a) Image-Text Matching

Models global correlation
- (between “images” and “texts”)
\(\mathcal{L}_{I T}=p \log \mathcal{S}\left(z^I, z^T\right)+(1-p) \log \left(1-\mathcal{S}\left(z^I, z^T\right)\right)\).
- Score function \(\mathcal{S}(\cdot)\) : Measures the alignment probability between the image and text
- \(p=1\) if paired and 0 otherwise

b) Region-Word Matching

Model local cross-modal correlation
- (between “image regions” and “words”)
For dense visual recognition tasks
- e.g., Object detection
\(\mathcal{L}_{R W}=p \log \mathcal{S}^r\left(r^I, w^T\right)+(1-p) \log \left(1-\mathcal{S}^r\left(r^I, w^T\right)\right)\).
- \(\left(r^I, w^T\right)\) : Region-word pair
- \(p=1\) if paired and 0 otherwise

(3) VLM Pretraining Frameworks

Two-tower framework
- Encoded with two separate encoders
Two-leg framework
- Introduces additional multi-modal fusion layers
One-tower VLMs
- Unify vision and language learning in a single encoder
  
  \(\rightarrow\) Aming to facilitate efficient communications across data modalities

(4) Evaluation Setups and Downstream Tasks

P1) Zero-shot Prediction

Common way of evaluating VLMs’ generalization capability

a) Image Classification

What? Aims to classify Image
How? By comparing the embeddings of images and texts
- where “prompt engineering” is often employed to generate task-related prompts like “a photo of a [label].”

b) Semantic Segmentation

What? Assign a category label to each pixel in images
How? By comparing the embeddings of the given image pixels and texts

c) Object Detection

What? Localize and classify objects in images
How? By comparing the embeddings of the given object proposals and texts

d) Image-Text Retrieval

What? Retrieve the demanded samples from one modality given the cues from another modality
Text-to-image retrieval & Image-to-text retrieval

(VLM survey) (Part 2; VLM Foundations & Datasets)

Seunghan Lee

Vision-Language Models for Vision Tasks: A Survey

Contents

3. VLM Foundations

(1) Network Architectures

P1) For Image

P2) For Text

(2) Pretraining Objectives

P1) Contrastive Objectives

P2) Generative Objectives

P3) Alignment Objectives

(3) VLM Pretraining Frameworks

(4) Evaluation Setups and Downstream Tasks

P1) Zero-shot Prediction

P2) Linear Probing

4. Datasets

(1) Pretraining

(2) Evaluation

You May Also Enjoy