SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
https://arxiv.org/pdf/2108.10904
1. Abstract
Vision-Language Pretraining (VLP)
-
Impressive performance on many multimodal downstream tasks
-
Limitation: Expensive annotations including clean image captions and regional labels
Visual Language Model (SimVLM)
Relax these constraints & Present a minimalist pretraining framework
- (1) Reduces the training complexity by exploiting large-scale weak supervision
- (2) End-to-end with a single prefix language modeling objective
2. SimVLM
(1) Objective: Prefix LM
Preliminaries: LM loss
- \(\mathcal{L}_{\mathrm{LM}}(\theta)=-\mathbb{E}_{\mathbf{x} \sim D}\left[\log P_\theta(\mathbf{x})\right]=-\mathbb{E}_{\mathbf{x} \sim D}\left[\sum_{t=1}^T \log P_\theta\left(\mathbf{x}_t \mid \mathbf{x}_{<t}\right)\right]\).
Proposal: Prefix Language Modeling (PrefixLM)
- During pretraining, a prefix sequence of tokens of (a randomly selected) length \(T_p\) is truncated from input sequence
\(\mathcal{L}_{\text {PrefixLM }}(\theta)=-\mathbb{E}_{\mathbf{x} \sim D}\left[\log P_\theta\left(\mathbf{x}_{\geq T_p} \mid \mathbf{x}_{<T_p}\right)\right]=-\mathbb{E}_{\mathbf{x} \sim D}\left[\sum_{t=T_p}^T \log P_\theta\left(\mathbf{x}_t \mid \mathbf{x}_{\left[T_p, t\right]}, \mathbf{x}_{<T_p}\right)\right]\).
Difference with LM?
- Bi-directional attention on the prefix sequence (e.g. \(\mathbf{x}_{<T_p}\) )
- Autoregressive factorization on the remaining tokens (e.g. \(x_{\geq T_p}\) )
Details:
-
Images can be considered as prefix for their textual descriptions!
-
Prepend image feature sequence of length \(T_i\) to the text sequence
( + Enforce the model to sample a prefix of length \(T_p \geq T_i\) )
-
(1) + (2)
- (1) Bidirectional contextualized representation as in MLM
- (2) Text generation similar to LM
(2) Architecture
Backbone: Transformer
- Bidirectional attention within the prefix sequence
- Applicable for both decoder-only & encoder-decoder sequence-to-sequence LMs
Refer to Figure 1