Vision-Language Models for Vision Tasks: A Survey
https://arxiv.org/pdf/2304.00685
Contents
- Abstract
- (1) Introduction
- Visual recognition tasks
- VLM training paradigm
- Two line of research
- Contributions
- (2) Background
- Training Paradigms for Visual Recognition
- Development of VLMs for Visual Recognition
Abstract
Vision-Language Models (VLMs)
- Goal: Learn rich “vision-language” correlation
- Dataset: Web-scale “image-text” pairs
- Results: Enables zero-shot predictions (on various visual recognition tasks)
This paper:
Review of VLMs for various visual recognition tasks
- (1) Background of visual recognition
- (2) Foundations of VLM
- Network architectures
- Pre-training objectives
- Downstream tasks
- (3) Datasets in VLM pre-training and evaluations
- (4) Review & Categorization of existing ….
- VLM “pre-training” methods
- VLM “transfer learning” methods
- VLM “knowledge distillation” methods
- (5) Benchmark
- (6) Research challenges & Future works
1. Introduction
P1) Visual recognition
Visual recognition tasks (IOS)
- (1) Image classification
- (2) Object detection
- (3) Semantic segmentation
Challenges? (Traditional ML \(\rightarrow\) E2E DL)
- (1) Slow convergence of DNN training (under learning from scratch)
- (2) Laborious collection of datasets
P2–3) VLM training paradigm
- Previous) Pretraining - Finetuning - Prediction
- Recent) Pretraining - “Zero-shot” Prediction
- Step 1) VLM is pre-trained with large-scale image-text pairs
- Step 2) Pretrained VLM can be applied to downstream tasks w/o fine-tuning
VLM Pre-training
-
By vision-language objectives
\(\rightarrow\) Enable to learn image-text correspondences
-
e.g., CLIP: Employs contrastive learning
-
Pretrained CLIP: Superior “zero-shot” performance (on 36 visual recognition tasks)
P4) Two line of research
-
VLMs with “Transfer learning”
-
Effective adaptation of pre-trained VLMs towards various tasks!
-
e.g., prompt tuning, visual adaptation
-
-
VLMs with “Knowledge distillation”
- Explores on how to distill knowledge from VLMs to tasks
P6) Main contributions of this work
- Systematic review of VLMs (for visual recognition tasks)
- Up-to-date progress of VLMs
- Research challenges & Potential research directions
2. Background
- Training paradigm of visual recognition
- VLM Pre-training & Zero-shot Prediction
- Development of the VLMs for visual recognition.
(1) Training Paradigms for Visual Recognition
P1) Traditional Machine Learning and Prediction
Rely heavily on feature engineering (with hand-crafted features)
\(\rightarrow\) Requires domain experts
P2) Deep Learning from Scratch and Prediction
Enables end-to-end trainable DNNs
Two new challenges:
- (1) Slow convergence of DNN training (under from scratch)
- (2) Laborious collection of large-scale, task-specific, and crowd-labelled data
P3-6) Change in paradigms
Paradigm 1) Scratch + Prediction
Paradigm 2) Supervised Pre-training + Fine-tuning + Prediction
Paradigm 3) Unsupervised Pre-training + Fine-tuning + Prediction
Paradigm 4) Unsupervised Pre-training + Zero-shot Prediction
- Enables effective use of large-scale web data for pretraining
- Zero-shot predictions w/o fine-tuning
Improve VLMs from 3 perspectives
- (1) Collecting (image-text) data
- (2) Designing high-capacity models
- (3) Designing new pre-training objectives
(2) Development of VLMs for Visual Recognition
Great progresses since CLIP
P1) Pre-training Objectives
“single objective” \(\rightarrow\) “multiple hybrid objectives”
- Early VLMs: single objective
- Recent VLMs: multiple objectives
- e.g., contrastive, alignment and generative objectives
P2) Pre-training Frameworks
“multiple separate” networks \(\rightarrow\)“unified” network
-
Early VLMs: Two-tower pre-training frameworks
-
Recent VLMs: One-tower pretraining framework
\(\rightarrow\) Encodes images and texts with a unified network
( with less GPU memory & more efficient communications across modalities )
P3) Downstream tasks
“simple” tasks \(\rightarrow\) “complex” tasks
-
Early VLMs: Image-level visual recognition tasks
-
Recent VLMs: General-purpose
-
Also work for dense prediction tasks
( that are complex and require localization related knowledge )
-
3. VLM Foundations
VLM pre-training
- Aims to pretrain a VLM to learn image-text correlation
- For effective zero-shot predictions
Procedure
- Step 1) Text encoder & Image encoder
- To extract image and text features
- Step 2) Pre-training objectives
- Learns the vision-language correlation
- Step 3) Evaluation
- On unseen data in a zero-shot manner
This section: Foundations of VLM pre-training
- a) Network architectures
- For extracting image and text features
- b) Pre-training objectives
- For modelling vision-language correlation
- c) Frameworks
- For VLM pre-training
- d) Downstream tasks
- For VLM evaluations
(1) Network Architectures
Notation
- Dataset \(\mathcal{D}=\left\{x_n^I, x_n^T\right\}_{n=1}^N\),
- Image encoder \(f_{\theta}\) ,
- Image embedding \(z_{\mathrm{n}}^I=f_\theta\left(x_n^I\right)\)
- Text encoder \(f_{\phi}\),
- Text embedding \(z_n^T=f_\phi\left(x_n^T\right)\),
P1) For Image
-
(1) CNN-based (e.g., ResNet)
-
(2) Transformer-based (e.g., ViT)
P2) For Text
Most VLM studies (e.g., CLIP):
- Employs Transformer ( + with minor modifications )
(2) Pretraining Objectives
Three categories:
- (1) Contrastive objectives
- (2) Generative objectives
- (3) Alignment objectives
P1) Contrastive Objectives
a) Image CL
- InfoNCE and its variants
- \(\mathcal{L}_I^{\text{InfoNCE}}=-\frac{1}{B} \sum_{i=1}^B \log \frac{\exp \left(z_i^I \cdot z_{+}^I / \tau\right)}{\sum_{j=1, j \neq i}^{B+1} \exp \left(z_i^I \cdot z_j^I / \tau\right)}\).
- \(z_i^I\) : Query embedding
- \(\left\{z_j^I\right\}_{j=1, j \neq 1}^{B+1}\) : Key embeddings
- \(z_{+}^I\): \(z_i^I\) ‘s positive key ( rest: negative keys )
b) Image-Text CL
-
Symmetrical image-text InfoNCE loss
( i.e., \(\mathcal{L}_{\text {InfoNCE }}^{\prime }=\) \(\mathcal{L}_{I \rightarrow T}+\mathcal{L}_{T \rightarrow I}\) )
- \(\mathcal{L}_{I \rightarrow T}\) : Contrasts the (query image & text keys)
- \(\mathcal{L}_{T \rightarrow I}\): Contrasts the (query text & image keys)
-
\(\begin{aligned} & \mathcal{L}_{I \rightarrow T}=-\frac{1}{B} \sum_{i=1}^B \log \frac{\exp \left(z_i^I \cdot z_i^T / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^I \cdot z_j^T / \tau\right)}, \\ & \mathcal{L}_{T \rightarrow I}=-\frac{1}{B} \sum_{i=1}^B \log \frac{\exp \left(z_i^T \cdot z_i^I / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^T \cdot z_j^I / \tau\right)} . \end{aligned}\).
c) Image-Text-Label CL
-
SupCon + Image-text CL
(i.e., \(\mathcal{L}_{\text {infoNCE }}^{I T L}=\mathcal{L}_{I \rightarrow T}^{I T L}+\mathcal{L}_{T \rightarrow I}^{I T L}\) )
-
\(\begin{aligned} & \mathcal{L}_{I \rightarrow T}^{I T L}=-\sum_{i=1}^B \frac{1}{ \mid \mathcal{P}(i) \mid } \sum_{k \in \mathcal{P}(i)} \log \frac{\exp \left(z_i^I \cdot z_k^T / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^I \cdot z_j^T / \tau\right)}, \\ & \mathcal{L}_{T \rightarrow I}^{I T L}=-\sum_{i=1}^B \frac{1}{ \mid \mathcal{P}(i) \mid } \sum_{k \in \mathcal{P}(i)} \log \frac{\exp \left(z_i^T \cdot z_k^I / \tau\right)}{\sum_{j=1}^B \exp \left(z_i^T \cdot z_j^I / \tau\right)}, \end{aligned}\).
- where \(k \in \mathcal{P}(i)=\left\{k \mid k \in B, y_k=y_i\right\}\)
P2) Generative Objectives
a) Masked Image Modeling
- \(\mathcal{L}_{M I M}=-\frac{1}{B} \sum_{i=1}^B \log f_\theta\left(\bar{x}_i^I \mid \hat{x}_i^I\right)\).
b) Masked Language Modeling
- \(\mathcal{L}_{M L M}=-\frac{1}{B} \sum_{i=1}^B \log f_o\left(\vec{x}_i^T \mid \hat{x}_i^T\right)\).
c) Masked Cross-Modal Modeling
- MCM = MIM + MLM
- \(\mathcal{L}_{M C M}=-\frac{1}{B} \sum_{i=1}^B\left[\log f_\theta\left(\bar{x}_i^I \mid \hat{x}_i^I, \hat{x}_i^T\right)+\log f_\phi\left(\bar{x}_i^T \mid \hat{x}_i^I, \hat{x}_i^T\right)\right]\).
- Details)
- Step 1) Given an “image-text pair”
- Step 2) Randomly masks
- a subset of “image” patches
- a subset of “text” tokens
- Step 3) Learns to reconstruct them conditioned on ..
- unmasked “image” patches
- unmasked “text” tokens
d) Image-to-Text Generation
- Aims to predict text \(x^T\) autoregressively
- based on the image paired with \(x^T\)
- \(\mathcal{L}_{I T G}=-\sum_{l=1}^L \log f_\theta\left(x^T \mid x_{<l,}^T z^I\right)\).
- \(L\) : # of tokens to be predicted
P3) Alignment Objectives
Align the image-text pair via ..
- (1) (Global) Image-text matching
- (2) (Local) Region-word matching
on the embedding space.
a) Image-Text Matching
- Models global correlation
- (between “images” and “texts”)
- \(\mathcal{L}_{I T}=p \log \mathcal{S}\left(z^I, z^T\right)+(1-p) \log \left(1-\mathcal{S}\left(z^I, z^T\right)\right)\).
- Score function \(\mathcal{S}(\cdot)\) : Measures the alignment probability between the image and text
- \(p=1\) if paired and 0 otherwise
b) Region-Word Matching
- Model local cross-modal correlation
- (between “image regions” and “words”)
- For dense visual recognition tasks
- e.g., Object detection
- \(\mathcal{L}_{R W}=p \log \mathcal{S}^r\left(r^I, w^T\right)+(1-p) \log \left(1-\mathcal{S}^r\left(r^I, w^T\right)\right)\).
- \(\left(r^I, w^T\right)\) : Region-word pair
- \(p=1\) if paired and 0 otherwise
(3) VLM Pretraining Frameworks
-
Two-tower framework
- Encoded with two separate encoders
-
Two-leg framework
- Introduces additional multi-modal fusion layers
-
One-tower VLMs
-
Unify vision and language learning in a single encoder
\(\rightarrow\) Aming to facilitate efficient communications across data modalities
-
(4) Evaluation Setups and Downstream Tasks
P1) Zero-shot Prediction
Common way of evaluating VLMs’ generalization capability
a) Image Classification
- What? Aims to classify Image
- How? By comparing the embeddings of images and texts
- where “prompt engineering” is often employed to generate task-related prompts like “a photo of a [label].”
b) Semantic Segmentation
- What? Assign a category label to each pixel in images
- How? By comparing the embeddings of the given image pixels and texts
c) Object Detection
- What? Localize and classify objects in images
- How? By comparing the embeddings of the given object proposals and texts
d) Image-Text Retrieval
- What? Retrieve the demanded samples from one modality given the cues from another modality
- Text-to-image retrieval & Image-to-text retrieval
P2) Linear Probing
pass