Vision-Language Models (VLMs)

참고: https://encord.com/blog/vision-language-models-guide/

Overview
(1) VLM architectures
(2) VLM evaluation strategies
(3) VLM mainstream datasets
(4) Key challenges, primary applications, and future trends

Overview

Vision-language model (VLM)

Input: Images & Respective textual descriptions
Goal: Learns to associate the knowledge from the two modalities
- Vision model: Captures spatial features from the images
- Language model: Encodes information from the text

\(\rightarrow\) Learns to understand images and transforms the knowledge into text (and vice versa)

Training VLMS

(1) Pre-training foundation models
- Contrastive Learning
- Masked language-image modeling
(2) Zero-shot learning & Transfer Learning (w/ fine-tuning)

1. VLM Architectures

Mainstream models: CLIP, Flamingo, and VisualBert

(1) CLIP

Contrastive learning in CLIP

Similarity between text and image embeddings

3 Step process (to enable zero-shot predictions)

Step 1) Pretrain
- Train a text & image encoder
Step 2) Converts training dataset classes into captions
Step 3) Zero-shot prediction
- Estimates the best caption for the given input image

ALIGN : Also uses image and textual encoders to minimize the distance between similar embeddings with contrastive learning

(2) SimVLM & VirTex & Frozen

PrefixLM

NLP learning technique for model pre-training

Input: Part of the text (= prefix)
Goal: Predict the next word in the sequence

PrefixLM in VLMs

Enables the model to predict the next sequence of words based on …

\(\rightarrow\) an image & its respective prefix text

Model: Vision Transformer (ViT)

(1) Vision part
- Divides an image into a 1d-patch sequence
- Applies convolution or linear projection over the processed patches
  
  \(\rightarrow\) Generate contextualized visual embeddings
(2) Text part
- Converts the text prefix relative to the patch into a token embedding
Transformer’s encoder-decoder blocks receive both “visual” and “token” embeddings

a) SimVLM

Popular architecture utilizing the PrefixLM
Simple Transformer architecture
- Encoder: to learn image-prefix pairs
- Decoder: to generate an output sequence
Good generalization and zero-shot learning capabilities

b) VirTex

(1) Image: **CNN **
(2) Text: Textual head with transformers
Train the model end-to-end to predict the image captions

( by feeding image-text pairs to the textual head )

c) Frozen

PrefixLM vs. Frozen PrefixLM
- (1) PrefixLM : Train visual and textual encoders from scratch
- (2) Frozen PrefixLM : Use pre-trained networks
  - Only update the parameters of the image encoders
Encoders
- Text encoder: Any LLMs
- Visual encoder: Any pre-trained visual foundation model

(3) Flamingo

Architecture

Text model: **Chinchilla **
- Freeze the LLM
Vision model: CLIP-like vision encoder
- Process the image through a Perceiver Sampler
  - Results in faster inference & ideal for few-shot learning

(4) Multimodal Fusing with Cross-Attention

Pre-trained LLM for visual representation learning

\(\rightarrow\) By adding cross-attention layers

VisualGPT

Key: Adaptation of an LLM’s pre-trained encoder for visual tasks
How: Employs a novel self-resurrecting encoder-decoder attention mechanism
- To quickly adapt the LLM with a small amount of in-domain image-text data
Self-resurrecting activation unit: produces sparse activations

\(\rightarrow\) Prevent accidental overwriting of linguistic knowledge

( avoids the issue of vanishing gradients )

Procedure

Step 1) Extract relevant objects from an image input
Step 2) Feed them to a visual encoder

\(\rightarrow\) Obtain visual representations
Step 3) Feed the representations to a decoder
- Decoder: Initialized with weights according to pre-trained LLM
- Self-resurrecting activation unit (SRAU)
  
  \(\rightarrow\) Balances the visual and textual information

(5) Masked-Language Modeling (MLM) & Image-Text Matching (ITM)

Adapt the MLM and ITM techniques for visual tasks!

VisualBERT

(Trained on the COCO dataset)

A simple and flexible framework for modeling vision-and-language tasks
Stack of Transformer layers: Implicitly align elements of …
- (1) An input text
- (2) Regions in an associated input image
Propose two visually-grounded language model objectives for pre-training: MLM & ITM
- ITM: Whether or not a caption matches the image

(6) No Training

Directly use large-scale, pre-trained VLMs without any fine-tuning

e.g.) MAGIC and ASIF : Training-free frameworks
- Predict text descriptions with input image

MAGIC

“Specialized score” based on CLIP-generated image embeddings to guide LLMs’ output

ASIF

Key idea: similar images have similar captions
Step 1) Computes the similarities between the …
- Single Image) training dataset’s query
- Multiple Images) candidate images
Step 2) Compares the …
- Single Image) Query image embedding
- Multiple Texts) Text embeddings of the corresponding candidate images
Step 3) Predicts a description whose embeddings are the most similar to those of the query image

(7) Knowledge Distillation

Train VLMs from larger, pre-trained models.

ViLD

Teacher: Pre-trained open-vocabulary image classification model
Student: Two-stage detector

\(\rightarrow\) Matches textual embeddings from a textual encoder with image embeddings

2. Evaluating VLMs

Assess the quality of the relationships between the image and text

Example) Image captioning model

Comparing the generated captions to the [ground-truth](https://encord.com/glossar

Various automated n-gram-based **evaluation strategies to compare the **predicted labels

in terms of accuracy, semantics, and information precision.

Examples:

BLEU (Bilingual Evaluation Understudy)
- Originally proposed to evaluate machine translation tasks
- How? “Precision” of the target text vs. reference (ground truth)
  
  by considering how many words in the **candidate sentence ** appear in the reference.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- How? “Recall” by considering how many words in the **reference sentence ** appear in the candidate.
METEOR (Metric for Evaluation of Translation with Explicit Ordering)
- How? “Harmonic mean” of precision and recall
  - More weight to recall and multiplying it with a penalty term
CIDEr (Consensus-based Image Description Evaluation)
- How? “TF-IDF scores”: compares a target sentence to a set of human sentences by computing the average similarity between reference and target sentences

Twitter Facebook LinkedIn

Vision-Language Models (VLMs)

Seunghan Lee

Vision-Language Models (VLMs)

Contents

Overview

1. VLM Architectures

(1) CLIP

(2) SimVLM & VirTex & Frozen

PrefixLM

PrefixLM in VLMs

a) SimVLM

b) VirTex

c) Frozen

(3) Flamingo

(4) Multimodal Fusing with Cross-Attention

VisualGPT

(5) Masked-Language Modeling (MLM) & Image-Text Matching (ITM)

VisualBERT

(6) No Training

MAGIC

ASIF

(7) Knowledge Distillation

ViLD

2. Evaluating VLMs

You May Also Enjoy