Flamingo; a Visual Language Model for Few-Shot Learning

NeurIPS 2022

less than 1 minute read

Seunghan Lee

Seunghan Lee

Deep Learning, Data Science, Statistics

Flamingo: a Visual Language Model for Few-Shot Learning

https://arxiv.org/pdf/2204.14198

1. Abstract

Flamingo

A family of VLMs
Rapid adapted to novel tasks using only a handful of annotated examples
Key architectural innovations
- (1) Bridge powerful pretrained vision & language models
- (2) Handle sequences of arbitrarily interleaved visual and textual data
- (3) Ingest images or videos as inputs
Trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images

2. Approach

Flamingo = VLM with…

Input: Text interleaved with images/videos
Output: Free-form text

Key architectural components

To leverage pretrained vision and language models and bridge them effectively
(1) Perceiver Resampler
- Input: Receives spatio-temporal features from the Vision Encoder
- Output: Fixed number of visual tokens
(2) Cross-attention layers
- Visual tokens are used to condition the frozen LM!
- Offer an expressive way for the LM to incorporate visual information for the next-token prediction task!
- \(p(y \mid x)=\prod_{\ell=1}^L p\left(y_{\ell} \mid y_{<\ell}, x_{\leq \ell}\right)\).

Twitter Facebook LinkedIn

You May Also Enjoy

2 minute read

2 minute read

8 minute read

2 minute read