VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

https://arxiv.org/pdf/2111.02358


VLMo

Unified Vision-Language pretrained Model (VLMo)

  • Dual encoder & Fusion encoder
  • Mixture-of-Modality-Experts (MOME) Transformer
    • Each block contains a pool of “modality-specific” experts & a shared selfattention layer.
  • Can be fine-tuned as …
    • Fusion encoder: For vision-language classification tasks
    • Dual encoder: For efficient image-text retrieval
  • Stagewise pre-training strategy
    • Not only image-text pairs
    • But also image-only & text-only data


figure2