BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

https://arxiv.org/pdf/2301.12597

1. Abstract

Cost of VLP: Expensive!

Proposal: BLIP-2

Generic and efficient pretraining strategy
- Bootstraps VLP from off-the-shelf frozen pre-trained image encoders and frozen LLM
Q-Former (Querying Transformer)
- Lightweight model to bridge the modality gap
- Pretrained in two stages
  - Stage 1) Bootstraps vision-language representation learning
    - from a frozen image encoder
  - Stage 2) Bootstraps vision-to-language generative learning
    - from a frozen language model

New VLP method

Querying Transformer (Q-Former)

To bridge the modality gap
Pre-trained in two stages:
- (1) Vision-language representation learning stage with a frozen image encoder
- (2) Vision-to-language generative learning stage with a frozen LLM

Trainable module to bridge the gap btw text & image
Extracts a fixed number of output features from the image encoder

( independent of input image resolution )

Share the same self-attention layers
(1) Image transformer
- Interacts with the frozen image encoder for visual feature extraction
- Use Cross Attention
(2) Text transformer
- Function as both a text encoder and a text decoder

a) Input = Set number of learnable query embeddings
b) Self-attention
- Queries interact with each other
c) Cross-attention
- Queries interact with frozen image features
(optional) Depending on the pretraining task
- Queries can additionally interact with the text through the same self-attention layers.
- Depending on the pre-training task, apply different self-attention masks to control query-text interaction

Initialization

QFormer = Pre-trained weights of BERTbase
- Cross-attention layers = Randomly initialized
32 queries (each with 768 dimension)

a) Dataset: Pre-train using image-text pairs!

b) Goal: Train the Q-Former such that …

\(\rightarrow\) The queries can learn to extract visual representation that is most informative of the text

c) Loss: Jointly optimize 3pre-training objectives

Inspired by BLIP (Li et al., 2022)
- c-1) Image-Text Contrastive Learning (ITC)
- c-2) Image-grounded Text Generation (ITG)
- c-3) Image-Text Matching (ITM)
Each objective employs a different attention masking strategy ( btw queries and text )

Generative pre-training stage

Connect QFormer (+ frozen image encoder) & frozen LLM
Projection layer: FC layer
- To linearly project the output query embeddings \(Z\) into the same dimension as the text embedding of the LLM.
Projected query embeddings:
- Prepended to the input text embeddings
- Function as soft visual prompts that condition the LLM on visual representation

Two types of LLMs