Vision-Language Models (VLMs)

참고: https://encord.com/blog/vision-language-models-guide/


Contents

  • Overview
  • (1) VLM architectures
  • (2) VLM evaluation strategies
  • (3) VLM mainstream datasets
  • (4) Key challenges, primary applications, and future trends


Overview

Vision-language model (VLM)

  • Input: Images & Respective textual descriptions
  • Goal: Learns to associate the knowledge from the two modalities
    • Vision model: Captures spatial features from the images
    • Language model: Encodes information from the text

$\rightarrow$ Learns to understand images and transforms the knowledge into text (and vice versa)


Training VLMS

  • (1) Pre-training foundation models
    • Contrastive Learning
    • Masked language-image modeling
  • (2) Zero-shot learning & Transfer Learning (w/ fine-tuning)


1. VLM Architectures

Mainstream models: CLIP, Flamingo, and VisualBert


(1) CLIP

Contrastive learning in CLIP

  • Similarity between text and image embeddings


3 Step process (to enable zero-shot predictions)

  • Step 1) Pretrain
    • Train a text & image encoder
  • Step 2) Converts training dataset classes into captions
  • Step 3) Zero-shot prediction
    • Estimates the best caption for the given input image

figure2


ALIGN : Also uses image and textual encoders to minimize the distance between similar embeddings with contrastive learning


(2) SimVLM & VirTex & Frozen

PrefixLM

NLP learning technique for model pre-training

  • Input: Part of the text (= prefix)
  • Goal: Predict the next word in the sequence


PrefixLM in VLMs

Enables the model to predict the next sequence of words based on …

$\rightarrow$ an image & its respective prefix text


Model: Vision Transformer (ViT)

  • (1) Vision part

    • Divides an image into a 1d-patch sequence

    • Applies convolution or linear projection over the processed patches

      $\rightarrow$ Generate contextualized visual embeddings

  • (2) Text part
    • Converts the text prefix relative to the patch into a token embedding
  • Transformer’s encoder-decoder blocks receive both “visual” and “token” embeddings


a) SimVLM

  • Popular architecture utilizing the PrefixLM
  • Simple Transformer architecture
    • Encoder: to learn image-prefix pairs
    • Decoder: to generate an output sequence
  • Good generalization and zero-shot learning capabilities

figure2


b) VirTex

  • (1) Image: **CNN **

  • (2) Text: Textual head with transformers

  • Train the model end-to-end to predict the image captions

    ( by feeding image-text pairs to the textual head )

figure2


c) Frozen

  • PrefixLM vs. Frozen PrefixLM
    • (1) PrefixLM : Train visual and textual encoders from scratch

    • (2) Frozen PrefixLM : Use pre-trained networks

      • Only update the parameters of the image encoders
  • Encoders
    • Text encoder: Any LLMs
    • Visual encoder: Any pre-trained visual foundation model

figure2

figure2


(3) Flamingo

Architecture

figure2


(4) Multimodal Fusing with Cross-Attention

Pre-trained LLM for visual representation learning

$\rightarrow$ By adding cross-attention layers


VisualGPT

  • Key: Adaptation of an LLM’s pre-trained encoder for visual tasks

  • How: Employs a novel self-resurrecting encoder-decoder attention mechanism

    • To quickly adapt the LLM with a small amount of in-domain image-text data
  • Self-resurrecting activation unit: produces sparse activations

    $\rightarrow$ Prevent accidental overwriting of linguistic knowledge

    ( avoids the issue of vanishing gradients )


Procedure

  • Step 1) Extract relevant objects from an image input

  • Step 2) Feed them to a visual encoder

    $\rightarrow$ Obtain visual representations

  • Step 3) Feed the representations to a decoder

    • Decoder: Initialized with weights according to pre-trained LLM

    • Self-resurrecting activation unit (SRAU)

      $\rightarrow$ Balances the visual and textual information


figure2


(5) Masked-Language Modeling (MLM) & Image-Text Matching (ITM)

Adapt the MLM and ITM techniques for visual tasks!


VisualBERT

(Trained on the COCO dataset)

figure2

  • A simple and flexible framework for modeling vision-and-language tasks
  • Stack of Transformer layers: Implicitly align elements of …
    • (1) An input text
    • (2) Regions in an associated input image
  • Propose two visually-grounded language model objectives for pre-training: MLM & ITM
    • ITM: Whether or not a caption matches the image


(6) No Training

Directly use large-scale, pre-trained VLMs without any fine-tuning

  • e.g.) MAGIC and ASIF : Training-free frameworks
    • Predict text descriptions with input image


MAGIC

  • “Specialized score” based on CLIP-generated image embeddings to guide LLMs’ output


ASIF

  • Key idea: similar images have similar captions
  • Step 1) Computes the similarities between the …
  • Step 2) Compares the …
    • Single Image) Query image embedding
    • Multiple Texts) Text embeddings of the corresponding candidate images
  • Step 3) Predicts a description whose embeddings are the most similar to those of the query image

figure2

figure2

figure2


(7) Knowledge Distillation

Train VLMs from larger, pre-trained models.


ViLD

  • Teacher: Pre-trained open-vocabulary image classification model
  • Student: Two-stage detector

$\rightarrow$ Matches textual embeddings from a textual encoder with image embeddings

figure2

figure2


2. Evaluating VLMs

Assess the quality of the relationships between the image and text

Example) Image captioning model

  • Comparing the generated captions to the [ground-truth](https://encord.com/glossar


Various automated n-gram-based **evaluation strategies to compare the **predicted labels

  • in terms of accuracy, semantics, and information precision.


Examples:

  • BLEU (Bilingual Evaluation Understudy)

    • Originally proposed to evaluate machine translation tasks

    • How? “Precision” of the target text vs. reference (ground truth)

      by considering how many words in the **candidate sentence ** appear in the reference.

  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

    • How? “Recall” by considering how many words in the **reference sentence ** appear in the candidate.
  • METEOR (Metric for Evaluation of Translation with Explicit Ordering)

    • How? “Harmonic mean” of precision and recall
      • More weight to recall and multiplying it with a penalty term
  • CIDEr (Consensus-based Image Description Evaluation)

    • How? “TF-IDF scores”: compares a target sentence to a set of human sentences by computing the average similarity between reference and target sentences

Categories: ,

Updated: