BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

https://arxiv.org/pdf/2301.12597


1. Abstract

Cost of VLP: Expensive!

  • Prohibitive due to end-to-end training of large-scale models!


Proposal: BLIP-2

  • Generic and efficient pretraining strategy
    • Bootstraps VLP from off-the-shelf frozen pre-trained image encoders and frozen LLM
  • Q-Former (Querying Transformer)
    • Lightweight model to bridge the modality gap
    • Pretrained in two stages
      • Stage 1) Bootstraps vision-language representation learning
        • from a frozen image encoder
      • Stage 2) Bootstraps vision-to-language generative learning
        • from a frozen language model


figure2


2. Method

New VLP method

  • Bootstraps from frozen pre-trained unimodal models


Querying Transformer (Q-Former)

  • To bridge the modality gap
  • Pre-trained in two stages:
    • (1) Vision-language representation learning stage with a frozen image encoder
    • (2) Vision-to-language generative learning stage with a frozen LLM


(1) Model Architecture: Q-Former

figure2

a) Q-Former

  • Trainable module to bridge the gap btw text & image

  • Extracts a fixed number of output features from the image encoder

    ( independent of input image resolution )


b) Consists of 2 transformer submodules

  • Share the same self-attention layers
  • (1) Image transformer
    • Interacts with the frozen image encoder for visual feature extraction
    • Use Cross Attention
  • (2) Text transformer
    • Function as both a text encoder and a text decoder


c) Details of Image Transformer

  • a) Input = Set number of learnable query embeddings
  • b) Self-attention
    • Queries interact with each other
  • c) Cross-attention
    • Queries interact with frozen image features
  • (optional) Depending on the pretraining task
    • Queries can additionally interact with the text through the same self-attention layers.
    • Depending on the pre-training task, apply different self-attention masks to control query-text interaction


Initialization

  • QFormer = Pre-trained weights of BERTbase
    • Cross-attention layers = Randomly initialized
  • 32 queries (each with 768 dimension)


(2) Bootstrap Vision-Language Representation Learning from a Frozen Image Encoder

a) Dataset: Pre-train using image-text pairs!

b) Goal: Train the Q-Former such that …

\(\rightarrow\) The queries can learn to extract visual representation that is most informative of the text

c) Loss: Jointly optimize 3pre-training objectives

  • Inspired by BLIP (Li et al., 2022)
    • c-1) Image-Text Contrastive Learning (ITC)
    • c-2) Image-grounded Text Generation (ITG)
    • c-3) Image-Text Matching (ITM)
  • Each objective employs a different attention masking strategy ( btw queries and text )


(3) Bootstrap Vision-to-Language Generative Learning from a Frozen LLM

Generative pre-training stage

  • Connect QFormer (+ frozen image encoder) & frozen LLM
  • Projection layer: FC layer
    • To linearly project the output query embeddings \(Z\) into the same dimension as the text embedding of the LLM.
  • Projected query embeddings:
    • Prepended to the input text embeddings
    • Function as soft visual prompts that condition the LLM on visual representation


Two types of LLMs

  • (1) “Decoder”-based LLMs
    • Pre-train with the LM loss
  • (2) “Encoder-decoder”-based LLMs
    • Pre-train with the prefix LM loss


3. Examples

**figure2