Abstract
1. MMRL
2. Previous Works vs. MMRL
3. Training & Inference of MMRL
Introduction
1. VLMs
2. Prompt-based approaches
3. Adapter-style learning methods
4. Proposal: MMRL
Related Works
1. VLMs
2. Efficient Transfer Learning

0. Abstract

Large-scale pre-trained VLMs

Essential for transfer learning
Adapting with “limited” few-shot data $\rightarrow$ Overfitting!

(1) MMRL

Proposal: Multi-Modal Representation Learning (MMRL)

(1) Shared + (2) Learnable + (3) Modality-agnostic representation space
Projects the “space” tokens $\rightarrow$ “text & image” representation tokens

$\rightarrow$ Facilitating more effective multi-modal interactions

(2) Previous works vs. MMRL

[Previous] Solely optimize class token features

[MMRL] Integrates representation tokens at higher layers of the encoders

Higher layers = Dataset-specific features
Lower layers = Generalized knowledge

(3) Training & Inference of MMRL

a) Training

Both (1) representation features & (2) class features are optimized

(1) Representation tokens: with “trainable” projection layer
(2) Class token: with “frozen” projection layer

Regularization term

To align the class features & text features with the zero-shot features from the frozen VLM
Safeguarding the model’s generalization capacity

b) Inference

**Decoupling strategy **

[For “base” class] Both representation & class features
[For “new” task] Only the class features
- which retain more generalized knowledge

https://github.com/yunncheng/MMRL

1. Introduction

(1) VLMs

CLIP-based models

“Distinct” encoders for images & text
Employ CL on over 400 million image-text pairs

Limitation of VLMs: Adapting to new tasks

$\rightarrow$ $\because$ Fine-tuning requires considerable computational resources

Efficient adaptation of VLMs

a) Prompt engineering
- Involves crafting dataset-specific prompts
  - e.g., “A photo of a [CLASS], a type of pet.”*
b) Ensembling
- Integrate multiple zero-shot classifiers by varying context prompts
  - e.g., “A photo of a big [CLASS].” and “A photo of a small [CLASS].”

(2) Prompt-based approaches

(1) CoOp

[Background] Manual prompt design is time-consuming!!

(+ Does not guarantee the discovery of optimal prompts )
[Proposal] “Prompt learning”
- Prompts are modeled as continuous learnable vectors
- Optimized during training while keeping VLM parameters fixed
  
  $\rightarrow$ Enable efficient dataset adaptation

(2) MaPLe

[Background] Identified that prompt learning solely within the TEXT modality may be sub-optimal
[Proposal] Proposes a “Multi-modal prompt learning” approach
- Embed deep prompts into the lower layers of both VLM encoders
- Via a coupling function to enhance alignment between visual and textual representations.

(3) Adapter-style learning methods

Lightweight modules (e.g., MLPs) are integrated within VLMs

$\rightarrow$ To adjust extracted features for downstream datasets

(1) CLIP-Adapter

Freeze VLM
Fine-tuning features via an MLP adapter added to the image encoder

(Incorporates residual connections for feature fusion)

(2) MMA

Multimodal adapter
- Refines the alignment between text & vision
- By aggregating features from diverse branches into a unified feature space
Reveals that different layers within VLM encoders capture varying characteristics
- “Higher” layers = Discriminative, dataset-specific information
- “Lower” layers = Generalizable features

Current multimodal deep prompt learning method

Applies prompt concatenation at shallow layer

$\rightarrow$ May compromise generalizable knowledge

Limitation of previous works

[Prompting] Map visual prompts from text prompts
- Incorporate visual information via gradient propagation but ultimately remaining text-centric
  
  $\rightarrow$ Updates focused mainly on text prompts
[Prompting & Adapter-style] Solely optimize class token features using task-specific objectives

$\rightarrow$ Vulnerable to overfitting to specific data distributions or task categories

Especially when training data is scarce (e.g., few-shot setting)

(4) Proposal: MMRL

Novel multimodal representation learning framework
- Distinguishes from prompt learning and adapter-style methods
Shared, learnable representation space
- Independent of any modality within the higher layers of the encoder
- Serves as a bridge for multimodal interaction
  - Mapping tokens from this space $\rightarrow$ image and text tokens
  - Concatenated with the original encoder tokens
Two types of tokens
- (1) Representation tokens
  - Designed to learn dataset-specific knowledge from downstream tasks
- (2) (Original) Classification token
  - Regularized to retain a significant amount of generalizable knowledge.
Three key advantages
- (1) “Unbiased shared” representation space
- (2) Preservation of original “VLM generalization”
  - By avoiding prompt integration at shallow encoder layers
- (3) “Decoupled inference” across classes
  - Prompt learning & adapter-style methods: Refine only the class token features through learnable prompts or adapters
Prioritize optimizing “representation token” features,
- Projection layer: trainable
- Class token: fixed
Regularization term
- Goal: To further preserve the generalizability of the class token
- Aligns its features with the zero-shot features from the frozen VLM.
Inference
- [Base classes] Use both representation and class token features
- [Unseen classes] Use only the class token features

Main Contributions

Multi-Modal Representation Learning (MMRL) framework
- Incorporates a shared, unbiased, learnable space that bridges image and text modalities
- Facilitate multimodal interaction at the high layers of the original encoder.
**Decoupling strategy **
- Preserves VLM generalization by adapting representation tokens for downstream tasks
- Regularizing the original class token for new tasks.
Extensive experiments
- Substantially improves downstream adaptation and generalization

(1) VLMs

VLMs

Typically learn joint image-language representations via SSL
Leverage large-scale architectures & massive collections of image-text pairs
e.g., CLIP [34], ALIGN [16], FILIP [50], KOSMOS [15, 33], and VILA [24]

Examples

CLIP: Trained on a collection of 400 million image-text pairs
ALIGN: Leverages an impressive 1.8 billion pairs

Pros & Cons

Pros: Excel at learning generalized representations
Cons: Efficiently adapting them to specific downstream tasks remains a challenge

(2) Efficient Transfer Learning

a) Prompt Learning

Effective for adapting VLMs

Examples

(1) CoOp: Prompt learning by replacing fixed templates with learnable continuous vectors

$\rightarrow$ But compromising CLIP’s zero-shot and generalization capabilities.
(2) CoCoOp: Incorporates visual cues to generate instance specific prompts

$\rightarrow$ Improve generalization to class distribution shifts
(3) ProDA: Learns prompt distributions to enhance adaptability
(4) PLOT: Optimal transport to align the vision and text modalities
(5) KgCoOp: Retains general textual knowledge by minimizing divergence btw learned & crafted prompts
(6) ProGrad: Selectively updates gradients aligned with general knowledge
(7) RPO: Mitigates internal representation shifts using masked attention

Moving beyond text-focused approaches

(8) MaPLe: Integrates visual prompts mapped from text prompts through a coupling function
(9) ProVP: Employs single-modal visual prompts with contrastive feature re-formation
- To align prompted visual features with CLIP’s distribution
(10) PromptSRC: Employs a self-regularization strategy to mitigate overfitting
(11) MetaPrompt: Meta learning-based prompt tuning algorithm that encourages task-specific prompts to generalize across various domains orclasses
(12) TCP: Adapts textual knowledge into class aware tokens

Twitter Facebook LinkedIn

MMRL; Multi-Modal Representation Learning for Vision-Language Models

Seunghan Lee

Contents

0. Abstract

(1) MMRL

(2) Previous works vs. MMRL