Vision-Language Models for Vision Tasks: A Survey
https://arxiv.org/pdf/2304.00685
Contents
- (3) VLM Foundations
- VLM Network Architectures
- For image
- For text
- VLM Pretraining Objectives
- Contrastive objectives
- Generative objectives
- Alignment objectives
- VLM Pretraining Frameworks
- Evaluation Setups & Downstream Tasks
- VLM Network Architectures
- (4) Datasets
- Pretraining
- Evaluation
3. VLM Foundations
VLM pre-training
- Aims to pretrain a VLM to learn image-text correlation
- For effective zero-shot predictions
Procedure
- Step 1) Text encoder & Image encoder
- To extract image and text features
- Step 2) Pre-training objectives
- Learns the vision-language correlation
- Step 3) Evaluation
- On unseen data in a zero-shot manner
This section: Foundations of VLM pre-training
- a) Network architectures
- For extracting image and text features
- b) Pre-training objectives
- For modelling vision-language correlation
- c) Frameworks
- For VLM pre-training
- d) Downstream tasks
- For VLM evaluations
(1) Network Architectures
Notation
- Dataset \(\mathcal{D}=\left\{x_n^I, x_n^T\right\}_{n=1}^N\),
- Image encoder \(f_{\theta}\) ,
- Image embedding \(z_{\mathrm{n}}^I=f_\theta\left(x_n^I\right)\)
- Text encoder \(f_{\phi}\),
- Text embedding \(z_n^T=f_\phi\left(x_n^T\right)\),
P1) For Image
-
(1) CNN-based (e.g., ResNet)
-
(2) Transformer-based (e.g., ViT)
P2) For Text
Most VLM studies (e.g., CLIP):
- Employs Transformer ( + with minor modifications )
(2) Pretraining Objectives
Three categories:
- (1) Contrastive objectives
- (2) Generative objectives
- (3) Alignment objectives
P1) Contrastive Objectives
a) Image CL
- InfoNCE and its variants
- \(\mathcal{L}_I^{\text{InfoNCE}}=-\frac{1}{B} \sum_{i=1}^B \log \frac{\exp \left(z_i^I \cdot z_{+}^I / \tau\right)}{\sum_{j=1, j \neq i}^{B+1} \exp \left(z_i^I \cdot z_j^I / \tau\right)}\).
- \(z_i^I\) : Query embedding
- \(\left\{z_j^I\right\}_{j=1, j \neq 1}^{B+1}\) : Key embeddings
- \(z_{+}^I\): \(z_i^I\) ‘s positive key ( rest: negative keys )
b) Image-Text CL
-
Symmetrical image-text InfoNCE loss
( i.e., \(\mathcal{L}_{\text {InfoNCE }}^{\prime }=\) \(\mathcal{L}_{I \rightarrow T}+\mathcal{L}_{T \rightarrow I}\) )
- \(\mathcal{L}_{I \rightarrow T}\) : Contrasts the (query image & text keys)
- \(\mathcal{L}_{T \rightarrow I}\): Contrasts the (query text & image keys)
-
\(\begin{aligned} & \mathcal{L}_{I \rightarrow T}=-\frac{1}{B} \sum_{i=1}^B \log \frac{\exp \left(z_i^I \cdot z_i^T / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^I \cdot z_j^T / \tau\right)}, \\ & \mathcal{L}_{T \rightarrow I}=-\frac{1}{B} \sum_{i=1}^B \log \frac{\exp \left(z_i^T \cdot z_i^I / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^T \cdot z_j^I / \tau\right)} . \end{aligned}\).
c) Image-Text-Label CL
-
SupCon + Image-text CL
(i.e., \(\mathcal{L}_{\text {infoNCE }}^{I T L}=\mathcal{L}_{I \rightarrow T}^{I T L}+\mathcal{L}_{T \rightarrow I}^{I T L}\) )
-
\(\begin{aligned} & \mathcal{L}_{I \rightarrow T}^{I T L}=-\sum_{i=1}^B \frac{1}{ \mid \mathcal{P}(i) \mid } \sum_{k \in \mathcal{P}(i)} \log \frac{\exp \left(z_i^I \cdot z_k^T / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^I \cdot z_j^T / \tau\right)}, \\ & \mathcal{L}_{T \rightarrow I}^{I T L}=-\sum_{i=1}^B \frac{1}{ \mid \mathcal{P}(i) \mid } \sum_{k \in \mathcal{P}(i)} \log \frac{\exp \left(z_i^T \cdot z_k^I / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^T \cdot z_j^I / \tau\right)}, \end{aligned}\).
- where \(k \in \mathcal{P}(i)=\left\{k \mid k \in B, y_k=y_i\right\}\)
P2) Generative Objectives
a) Masked Image Modeling
- \(\mathcal{L}_{M I M}=-\frac{1}{B} \sum_{i=1}^B \log f_\theta\left(\bar{x}_i^I \mid \hat{x}_i^I\right)\).
b) Masked Language Modeling
- \(\mathcal{L}_{M L M}=-\frac{1}{B} \sum_{i=1}^B \log f_o\left(\vec{x}_i^T \mid \hat{x}_i^T\right)\).
c) Masked Cross-Modal Modeling
- MCM = MIM + MLM
- \(\mathcal{L}_{M C M}=-\frac{1}{B} \sum_{i=1}^B\left[\log f_\theta\left(\bar{x}_i^I \mid \hat{x}_i^I, \hat{x}_i^T\right)+\log f_\phi\left(\bar{x}_i^T \mid \hat{x}_i^I, \hat{x}_i^T\right)\right]\).
- Details)
- Step 1) Given an “image-text pair”
- Step 2) Randomly masks
- a subset of “image” patches
- a subset of “text” tokens
- Step 3) Learns to reconstruct them conditioned on ..
- unmasked “image” patches
- unmasked “text” tokens
d) Image-to-Text Generation
- Aims to predict text \(x^T\) autoregressively
- based on the image paired with \(x^T\)
- \(\mathcal{L}_{I T G}=-\sum_{l=1}^L \log f_\theta\left(x^T \mid x_{<l,}^T z^I\right)\).
- \(L\) : # of tokens to be predicted
P3) Alignment Objectives
Align the image-text pair via ..
- (1) (Global) Image-text matching
- (2) (Local) Region-word matching
on the embedding space.
a) Image-Text Matching
- Models global correlation
- (between “images” and “texts”)
- \(\mathcal{L}_{I T}=p \log \mathcal{S}\left(z^I, z^T\right)+(1-p) \log \left(1-\mathcal{S}\left(z^I, z^T\right)\right)\).
- Score function \(\mathcal{S}(\cdot)\) : Measures the alignment probability between the image and text
- \(p=1\) if paired and 0 otherwise
b) Region-Word Matching
- Model local cross-modal correlation
- (between “image regions” and “words”)
- For dense visual recognition tasks
- e.g., Object detection
- \(\mathcal{L}_{R W}=p \log \mathcal{S}^r\left(r^I, w^T\right)+(1-p) \log \left(1-\mathcal{S}^r\left(r^I, w^T\right)\right)\).
- \(\left(r^I, w^T\right)\) : Region-word pair
- \(p=1\) if paired and 0 otherwise
(3) VLM Pretraining Frameworks
-
Two-tower framework
- Encoded with two separate encoders
-
Two-leg framework
- Introduces additional multi-modal fusion layers
-
One-tower VLMs
-
Unify vision and language learning in a single encoder
\(\rightarrow\) Aming to facilitate efficient communications across data modalities
-
(4) Evaluation Setups and Downstream Tasks
P1) Zero-shot Prediction
Common way of evaluating VLMs’ generalization capability
a) Image Classification
- What? Aims to classify Image
- How? By comparing the embeddings of images and texts
- where “prompt engineering” is often employed to generate task-related prompts like “a photo of a [label].”
b) Semantic Segmentation
- What? Assign a category label to each pixel in images
- How? By comparing the embeddings of the given image pixels and texts
c) Object Detection
- What? Localize and classify objects in images
- How? By comparing the embeddings of the given object proposals and texts
d) Image-Text Retrieval
- What? Retrieve the demanded samples from one modality given the cues from another modality
- Text-to-image retrieval & Image-to-text retrieval
P2) Linear Probing
pass