Retrieval-Enhanced Contrastive Vision-Text Models

https://arxiv.org/pdf/2306.07196

Abstract
Introduction
Related Works
1. Visual-text pretraining
2. Knowledge-based vision-text models
3. Retrieval-based methods
Method
1. Retrieving Cross-modal External Knowledge
2. Learning how to fuse the retrieved knowledge
Experiments
1. Experimental Setup
2. Zero-shot Transfer

Abstract

(1) Contrastive image-text models (e.g., CLIP)

Building blocks of many SOTA systems
Excel at recognizing common generic concepts

\(\rightarrow\) Limitation: Struggle on fine-grained entities

(2) Proposal: Retrieval-enhanced contrastive (RECO) training

Encoding fine-grained knowledge directly into the model’s parameters
- Train the model to retrieve this knowledge from an external memory
- Propose to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time

(3) Effect

Greatly improves their zero-shot predictions
Can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP

(4) Experiments

Improves CLIP performance substantially on several challenging fine-grained tasks
- +10.9 on Stanford Cars
- +10.2 on CUB-2011
- +7.3 on the recent OVEN benchmark

1. Introduction

P1) Recent VLMs & Limitation

Development of VLM

\(\rightarrow\) Highly adaptable to various downstream tasks

Two parallel encoders using CL (i.e., two-tower models)

Encode images and texts into an aligned latent space
Enables appealing capabilities such as zero-shot transfer to different downstream applications
- e.g. image classification, image-text retrieval, open-world recognition

Limitation: Struggle on tasks requiring a more fine-grained understanding

P2) Two approaches

Approach 1) Scale and curate the pre-training dataset

To cover more and more image-text associations

Approach 2) Memory or knowledge-based approaches

Propose to rely on the access to an external source of knowledge
- K-Lite: How to improve vision-text models by enhancing the text captions with more comprehensive text definitions retrieved from an external dictionary
Limitation? Initial captions are augmented within their modality only

\(\rightarrow\) Limiting the potential added-value brought by the retrieved items!

P3) Proposal

Retrieval-augmented approach

Critical observation: (a) is simpler than (b)
- (a) Matching representations within the same modality
- (b) Matching representations across different modalities

\(\rightarrow\) Proposal: Utilize the inherent strength of learned image and text representations within their respective modalities to aid the alignment across modalities

Details

Convert these unimodal representations into a multi-modal format
- To improve their compatibility
Utilizing a web-scale corpus of image-text pairs for retrieval…
- Use image representation as a query
  - To identify the top-\(k\) most similar images
  - Incorporate the associated text to create a multi-modal representation.
- Use a text representation as a query
  - To identify the top-\(k\) most similar texts
  - Integrate the associated images to create a multi-modal representation

(1) Visual-text pretraining

CLIP, ALIGN

Potential of contrastive image-text pre-training
Two parallel uni-modal encoders t
Cross-modal contrastive objective

Vision-text contrastive models

= Basic building blocks of more powerful foundational models

e.g., CoCa (Yu et al., 2022), Flamingo (Alayrac et al., 2022), FLAVA (Singh et al., 2022), and PaLI (Chen et al., 2023)

Proposed work

Enhance the capabilities of the CLIP

( but not specific to CLIP )
How? By adding a light-weight retrieval module.

(2) Knowledge-based vision-text models

Improving upon different aspects of the contrastive vision-text models

(1) Training objectives
(2) Scaling

\(\rightarrow\) Little exploration has been done on their combination with memory or knowledge-based techniques

Knowledge-based vision-text models

[1] REACT (Liu et al., 2023)

Retrieves image-text pairs from an external memory

\(\rightarrow\) Build a training dataset specialized for a specific downstream task.
Proposed (vs. REACT)
- (1) Does not require any pre-knowledge about the nature of the downstream task
  
  \(\rightarrow\) \(\therefore\) Applicable in a full zero-shot transfer
- (2) Leverage items from the memory at inference time
  
  ( REACT: uses retrieved items to automatically generate a training set to finetune their model )

[2] K-LITE (Shen et al., 2022)

Learns vision-text models by leveraging external sources of knowledge
- e.g., WordNet (Meyer & Gurevych, 2012) or Wiktionary (Miller, 1998)
to complete captions with more descriptive content.
Proposed (vs. K-LITE)
- K-LITE: Retrieved knowledge is uni-modal (text) & External memory is not used for the image tower

[3] NNCLR (Dwibedi et al., 2021)

Image-only representation learning (VLM (X))
Finds the visual nearest-neighbor of each training image from a memory

[4] LGSimCLR (Banani et al., 2023)

Uses the language guidance to find most similar visual nearest-neighbor

NNCLR & LGSimCLR

(a) Only learn visual representations
(b) Use retrieval to enhance their supervision during training but not at inference

(3) Retrieval-based methods

Main argument of the retrieval-based methods

Not all the world knowledge can be compiled into a model’s parameters

\(\rightarrow\) \(\therefore\) Should also learn to rely on items retrieved from an external memory at inference

Retrieval-based methods

(Original) Shown their promise in various NLP tasks
(Recent) Increasing interest in the computer vision for retrieval-based methods

[1] SuS-X (Udandarao et al., 2023)

“Cross-modal search and cross-modal fusion”
Retrieve similar samples to the query sample from a large data-bank
Improve zero-shot classification performance

[2] RA-CLIP (Xie et al., 2023)

Enriches the CLIP visual representation
Retrieve image and text.
Limitation: Attempt to enrich the text representation degrades the performance,

3. Method

Goal: Equip powerful pre-trained VLMs with the ability to complement their representations with cross-modal knowledge retrieved from an external memory

Details

Do not retrain from scratch
Learn a light-weight retrieval fusion module on top of them
Does not propose a new model or loss

( Rather a new way of adapting pre-trained models )

Preliminaries

Notation:

\(\mathbf{v}=\) \(f_{\text {image }}(I)\) .
\(\mathbf{t}=f_{\text {text }}(T)\).

InfoNCE loss btw embeddings of different modalities:

\(\mathcal{L}_{\mathrm{NCE}}(\mathbf{V}, \mathbf{T})=-\sum_{i=1}^n\left[\log \frac{e^{\mathbf{v}_i^{\top} \mathbf{t}_i / \tau}}{\sum_j e^{\mathbf{v}_i^{\top} \mathbf{t}_j / \tau}}+\log \frac{e^{\mathbf{v}_i^{\top} \mathbf{t}_i / \tau}}{\sum_j e^{\mathbf{v}_j^{\top} \mathbf{t}_i / \tau}}\right]\).
where \(\mathbf{V}\) (resp. \(\mathbf{T}\) ) is the matrix composed of the \(n\) visual (resp. text) embeddings

Propose to augment the text and visual embeddings ( i.e. \(\mathbf{t}\) and \(\mathbf{v}\) ) with external cross-modal knowledge

To enhance both their expressiveness and their cross-modality alignment

Following section

3-1) How to retrieve relevant cross-modal knowledge
- based on within-modality search
3-2) How to fuse the retrieved information into the original embeddings

a) Memory

External source of knowledge by a memory

\(\mathcal{M}=\left\{\left(I_i, T_i\right)\right\}_{i=1}^M\) of \(M\) image-text pairs.
- Assume that \(\mathcal{M}\) is very large and covers a broad coverage of concepts

(In practice) Only a small-subset of \(\mathcal{M}\) is relevant (for a given input query)

\(\rightarrow\) Only consider the \(k\) most relevant items from \(\mathcal{M}\) for each input (via NN)

\(\mathrm{KNN}(\mathbf{v}, \mathcal{M})\) and \(\mathrm{KNN}(\mathbf{t}, \mathcal{M})\)

Goal: Augment the text and visual original embeddings with crossmodal knowledge

For a given text or image input…

Retrieval module \(\mathrm{KNN}(., \mathcal{M})\)
- \(\mathrm{KNN}_t(\mathbf{v}, \mathcal{M})\): Returns text embeddings from an image input
- \(\mathrm{KNN}_v(\mathbf{t}, \mathcal{M})\) : Returns image embeddings for text input

Also evaluate uni-modal fusion in our experiments!

Search relevant items in the memory \(\mathcal{M}\) based on within-modality similarities

Text-to-text similarity \((t \rightarrow t)\)
Image-to-image similarity \((v \rightarrow v)\)

Notation

\(\mathbf{V}^{\mathcal{M}}\) and \(\mathbf{T}^{\mathcal{M}}\) all the image and text embeddings from \(\mathcal{M}\)
- \(\mathbf{V}^{\mathcal{M}}=\left[f_{\text {image }}\left(I_1\right), \ldots, f_{\text {image }}\left(I_M\right)\right]\).
- \(\mathbf{T}^{\mathcal{M}}=\left[f_{\text {text }}\left(T_1\right), \ldots, f_{\text {text }}\left(T_M\right)\right]\).
\[\mathrm{KNN}_t^{v \rightarrow v}(\mathbf{v}, \mathcal{M})=\mathbf{T}^{\mathcal{M}}{ }_{\mathrm{NN}\left(\mathbf{v} ; \mathbf{V}^{\mathcal{M}}\right)}\]
- For input image embedding \(\mathbf{v}\),
- The KNN search is done between \(\mathbf{v}\) and \(\mathbf{V}^{\mathcal{M}}\)
- But the corresponding \(k\)-NN indices from the text embeddings \(\mathbf{T}^{\mathcal{M}}\) are selected.
\[\mathrm{KNN}_v^{t \rightarrow t}(\mathbf{t}, \mathcal{M})=\mathbf{V}^{\mathcal{M}}{ }_{\mathrm{NN}\left(\mathbf{t} ; \mathbf{T}^{\mathcal{M}}\right)}\]
- vice versa

Also evaluate cross-modal search

\(\rightarrow\) Leads to much poorer performance!

(2) Learning how to fuse the retrieved knowledge

Goal: Refine the original image and text embeddings \(\mathbf{v}\) and \(\mathbf{t}\) with the cross-modal knowledge gathered from \(\mathcal{M}\).

Notation

Refined image and text embeddings: \(\overline{\mathbf{v}}\) and \(\overline{\mathbf{t}}\)
- \(\overline{\mathbf{v}}=\phi_{\text {image }}\left(\mathbf{v}, \mathrm{KNN}_t^{v \rightarrow v}(\mathbf{v}, \mathcal{M})\right)\).
- \(\overline{\mathbf{t}}=\phi_{\text {text }}\left(\mathbf{t}, \mathrm{KNN}_v^{t \rightarrow t}(\mathbf{t}, \mathcal{M})\right)\),
where \(\phi\) is the fusion model

a) Transformer fusion

\(\phi_{\text {image }}\) and \(\phi_{\text {text }}\)

= One-layer multi-head self-attention transformer encoders

\(\rightarrow\) Allows the original embedding to attend to all the retrieved elements in the fusion process.

Also tried mean fusion \(\rightarrow\) Performs poorly

b) Learning

Train the fusion model \(\phi\) on \(\mathcal{D}=\left\{\left(I_i, T_i\right)\right\}_{i=1}^N\)

How? By performing retrieval at training time from the memory \(\mathcal{M}\).

Encoder \(f\) is kept frozen.

Loss: Minimize the alignment loss

\(\mathcal{L}=\mathcal{L}_{\mathrm{NCE}}(\overline{\mathbf{V}}, \overline{\mathbf{T}})+\mathcal{L}_{\mathrm{NCE}}(\overline{\mathbf{V}}, \mathbf{T})+\mathcal{L}_{\mathrm{NCE}}(\mathbf{V}, \overline{\mathbf{T}})\).

4. Experiments

(1) Experimental Setup

a) Training details

(1) Model

Train the fusion model on top of a frozen CLIP
Also present a variant of RECO on top of a frozen LiT-L16L

(2) Datasets: Conceptual Captions 12M

Image-text dataset containing about 10 M pairs.

(3) Details

Batch size of 4096
Learning rate of \(1 e^{-3}\)
10 epochs

(4) Memory: Subset of WebLI

Containing 1B image-text pairs

( Remove the near-duplicates of the test images from the memory )
Appendix: LAION-400M dataset

b) Evaluation datasets

Six image classification datasets

Stanford Cars (“Cars”)
CUB-200-2011 (“CUB”)
Oxford Flowers (“Flowers”)
ImageNet-1k (“Imlk”)
Places 365 (“Pl365”)
Stanford Dogs (“Dogs”)

Open-domain visual entity recognition (OVEN) benchmark

Containing 729 K test images
Belonging to 6 M entity candidates

Text-to-image & Image-to-text retieval on

Flickr30k (“Flickr”)
MS COCO (“COCO”)

c) Evaluation protocol

Zero-shot setting for all

No adaptation is done to the downstream task

As common in the literature ….

Add prompts to the text of the downstream tasks

(2) Zero-shot Transfer

a) Image classification

Zero-shot performance of CLIP & LiT on Image classification

Large improvements especially on the fine-grained datasets
- e.g., Improvement of original CLIP-B/32 accuracy by …
  - +10.9 on Cars
  - +10.2 on CUB
  - +5.8 on Flowers
Also improved on less fine-grained benchmarks
- e.g., ImageNet or Places
Performance gains are consistent across all vision-text backbones
- e.g., CLIP-R-50, CLIP-B/32, CLIP-L/14, and LiT-L16L

b) Open-domain visual entity recognition (OVEN)

Twitter Facebook LinkedIn

AAA (All About AI)

Seunghan Lee

Retrieval-Enhanced Contrastive Vision-Text Models

Contents

Abstract