Retrieval-Enhanced Contrastive Vision-Text Models

https://arxiv.org/pdf/2306.07196


Contents

  1. Abstract
  2. Introduction
  3. Related Works
    1. Visual-text pretraining
    2. Knowledge-based vision-text models
    3. Retrieval-based methods
  4. Method
    1. Retrieving Cross-modal External Knowledge
    2. Learning how to fuse the retrieved knowledge
  5. Experiments
    1. Experimental Setup
    2. Zero-shot Transfer


Abstract

(1) Contrastive image-text models (e.g., CLIP)

  • Building blocks of many SOTA systems
  • Excel at recognizing common generic concepts

\(\rightarrow\) Limitation: Struggle on fine-grained entities


(2) Proposal: Retrieval-enhanced contrastive (RECO) training

  • Encoding fine-grained knowledge directly into the model’s parameters

    • Train the model to retrieve this knowledge from an external memory

    • Propose to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time


(3) Effect

  • Greatly improves their zero-shot predictions

  • Can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP


(4) Experiments

  • Improves CLIP performance substantially on several challenging fine-grained tasks

    • +10.9 on Stanford Cars

    • +10.2 on CUB-2011

    • +7.3 on the recent OVEN benchmark


1. Introduction

P1) Recent VLMs & Limitation

Development of VLM

\(\rightarrow\) Highly adaptable to various downstream tasks


Two parallel encoders using CL (i.e., two-tower models)

  • Encode images and texts into an aligned latent space
  • Enables appealing capabilities such as zero-shot transfer to different downstream applications
    • e.g. image classification, image-text retrieval, open-world recognition


Limitation: Struggle on tasks requiring a more fine-grained understanding


P2) Two approaches

Approach 1) Scale and curate the pre-training dataset

  • To cover more and more image-text associations


Approach 2) Memory or knowledge-based approaches

  • Propose to rely on the access to an external source of knowledge

    • K-Lite: How to improve vision-text models by enhancing the text captions with more comprehensive text definitions retrieved from an external dictionary
  • Limitation? Initial captions are augmented within their modality only

    \(\rightarrow\) Limiting the potential added-value brought by the retrieved items!


P3) Proposal

Retrieval-augmented approach

  • Critical observation: (a) is simpler than (b)
    • (a) Matching representations within the same modality
    • (b) Matching representations across different modalities

\(\rightarrow\) Proposal: Utilize the inherent strength of learned image and text representations within their respective modalities to aid the alignment across modalities


Details

  • Convert these unimodal representations into a multi-modal format
    • To improve their compatibility
  • Utilizing a web-scale corpus of image-text pairs for retrieval…
    • Use image representation as a query
      • To identify the top-\(k\) most similar images
      • Incorporate the associated text to create a multi-modal representation.
    • Use a text representation as a query
      • To identify the top-\(k\) most similar texts
      • Integrate the associated images to create a multi-modal representation

figure2


2. Related Works

(1) Visual-text pretraining

CLIP, ALIGN

  • Potential of contrastive image-text pre-training
  • Two parallel uni-modal encoders t
  • Cross-modal contrastive objective


Vision-text contrastive models

= Basic building blocks of more powerful foundational models

  • e.g., CoCa (Yu et al., 2022), Flamingo (Alayrac et al., 2022), FLAVA (Singh et al., 2022), and PaLI (Chen et al., 2023)


Proposed work

  • Enhance the capabilities of the CLIP

    ( but not specific to CLIP )

  • How? By adding a light-weight retrieval module.


(2) Knowledge-based vision-text models

Improving upon different aspects of the contrastive vision-text models

  • (1) Training objectives
  • (2) Scaling

\(\rightarrow\) Little exploration has been done on their combination with memory or knowledge-based techniques


Knowledge-based vision-text models

[1] REACT (Liu et al., 2023)

  • Retrieves image-text pairs from an external memory

    \(\rightarrow\) Build a training dataset specialized for a specific downstream task.

  • Proposed (vs. REACT)

    • (1) Does not require any pre-knowledge about the nature of the downstream task

      \(\rightarrow\) \(\therefore\) Applicable in a full zero-shot transfer

    • (2) Leverage items from the memory at inference time

      ( REACT: uses retrieved items to automatically generate a training set to finetune their model )


[2] K-LITE (Shen et al., 2022)

  • Learns vision-text models by leveraging external sources of knowledge

    • e.g., WordNet (Meyer & Gurevych, 2012) or Wiktionary (Miller, 1998)

    to complete captions with more descriptive content.

  • Proposed (vs. K-LITE)

    • K-LITE: Retrieved knowledge is uni-modal (text) & External memory is not used for the image tower


[3] NNCLR (Dwibedi et al., 2021)

  • Image-only representation learning (VLM (X))
  • Finds the visual nearest-neighbor of each training image from a memory


[4] LGSimCLR (Banani et al., 2023)

  • Uses the language guidance to find most similar visual nearest-neighbor


NNCLR & LGSimCLR

  • (a) Only learn visual representations
  • (b) Use retrieval to enhance their supervision during training but not at inference


(3) Retrieval-based methods

Main argument of the retrieval-based methods

Not all the world knowledge can be compiled into a model’s parameters

\(\rightarrow\) \(\therefore\) Should also learn to rely on items retrieved from an external memory at inference


Retrieval-based methods

  • (Original) Shown their promise in various NLP tasks
  • (Recent) Increasing interest in the computer vision for retrieval-based methods


[1] SuS-X (Udandarao et al., 2023)

  • “Cross-modal search and cross-modal fusion”
  • Retrieve similar samples to the query sample from a large data-bank
  • Improve zero-shot classification performance


[2] RA-CLIP (Xie et al., 2023)

  • Enriches the CLIP visual representation
  • Retrieve image and text.
  • Limitation: Attempt to enrich the text representation degrades the performance,


3. Method

Goal: Equip powerful pre-trained VLMs with the ability to complement their representations with cross-modal knowledge retrieved from an external memory


Details

  • Do not retrain from scratch

  • Learn a light-weight retrieval fusion module on top of them

  • Does not propose a new model or loss

    ( Rather a new way of adapting pre-trained models )


Preliminaries

Notation:

  • \(\mathbf{v}=\) \(f_{\text {image }}(I)\) .
  • \(\mathbf{t}=f_{\text {text }}(T)\).


InfoNCE loss btw embeddings of different modalities:

  • \(\mathcal{L}_{\mathrm{NCE}}(\mathbf{V}, \mathbf{T})=-\sum_{i=1}^n\left[\log \frac{e^{\mathbf{v}_i^{\top} \mathbf{t}_i / \tau}}{\sum_j e^{\mathbf{v}_i^{\top} \mathbf{t}_j / \tau}}+\log \frac{e^{\mathbf{v}_i^{\top} \mathbf{t}_i / \tau}}{\sum_j e^{\mathbf{v}_j^{\top} \mathbf{t}_i / \tau}}\right]\).

  • where \(\mathbf{V}\) (resp. \(\mathbf{T}\) ) is the matrix composed of the \(n\) visual (resp. text) embeddings


Propose to augment the text and visual embeddings ( i.e. \(\mathbf{t}\) and \(\mathbf{v}\) ) with external cross-modal knowledge

  • To enhance both their expressiveness and their cross-modality alignment


Following section

  • 3-1) How to retrieve relevant cross-modal knowledge
    • based on within-modality search
  • 3-2) How to fuse the retrieved information into the original embeddings


(1) Retrieving Cross-modal External Knowledge

a) Memory

External source of knowledge by a memory

  • \(\mathcal{M}=\left\{\left(I_i, T_i\right)\right\}_{i=1}^M\) of \(M\) image-text pairs.
    • Assume that \(\mathcal{M}\) is very large and covers a broad coverage of concepts


(In practice) Only a small-subset of \(\mathcal{M}\) is relevant (for a given input query)

\(\rightarrow\) Only consider the \(k\) most relevant items from \(\mathcal{M}\) for each input (via NN)

  • \(\mathrm{KNN}(\mathbf{v}, \mathcal{M})\) and \(\mathrm{KNN}(\mathbf{t}, \mathcal{M})\)


b) Cross-modal fusion

Goal: Augment the text and visual original embeddings with crossmodal knowledge


For a given text or image input…

  • Retrieval module \(\mathrm{KNN}(., \mathcal{M})\)
    • \(\mathrm{KNN}_t(\mathbf{v}, \mathcal{M})\): Returns text embeddings from an image input
    • \(\mathrm{KNN}_v(\mathbf{t}, \mathcal{M})\) : Returns image embeddings for text input


Also evaluate uni-modal fusion in our experiments!

figure2


Search relevant items in the memory \(\mathcal{M}\) based on within-modality similarities

  • Text-to-text similarity \((t \rightarrow t)\)
  • Image-to-image similarity \((v \rightarrow v)\)


Notation

  • \(\mathbf{V}^{\mathcal{M}}\) and \(\mathbf{T}^{\mathcal{M}}\) all the image and text embeddings from \(\mathcal{M}\)
    • \(\mathbf{V}^{\mathcal{M}}=\left[f_{\text {image }}\left(I_1\right), \ldots, f_{\text {image }}\left(I_M\right)\right]\).
    • \(\mathbf{T}^{\mathcal{M}}=\left[f_{\text {text }}\left(T_1\right), \ldots, f_{\text {text }}\left(T_M\right)\right]\).
  • \[\mathrm{KNN}_t^{v \rightarrow v}(\mathbf{v}, \mathcal{M})=\mathbf{T}^{\mathcal{M}}{ }_{\mathrm{NN}\left(\mathbf{v} ; \mathbf{V}^{\mathcal{M}}\right)}\]
    • For input image embedding \(\mathbf{v}\),
    • The KNN search is done between \(\mathbf{v}\) and \(\mathbf{V}^{\mathcal{M}}\)
    • But the corresponding \(k\)-NN indices from the text embeddings \(\mathbf{T}^{\mathcal{M}}\) are selected.
  • \[\mathrm{KNN}_v^{t \rightarrow t}(\mathbf{t}, \mathcal{M})=\mathbf{V}^{\mathcal{M}}{ }_{\mathrm{NN}\left(\mathbf{t} ; \mathbf{T}^{\mathcal{M}}\right)}\]
    • vice versa


Also evaluate cross-modal search

\(\rightarrow\) Leads to much poorer performance!

figure2


(2) Learning how to fuse the retrieved knowledge

Goal: Refine the original image and text embeddings \(\mathbf{v}\) and \(\mathbf{t}\) with the cross-modal knowledge gathered from \(\mathcal{M}\).


Notation

  • Refined image and text embeddings: \(\overline{\mathbf{v}}\) and \(\overline{\mathbf{t}}\)

    • \(\overline{\mathbf{v}}=\phi_{\text {image }}\left(\mathbf{v}, \mathrm{KNN}_t^{v \rightarrow v}(\mathbf{v}, \mathcal{M})\right)\).
    • \(\overline{\mathbf{t}}=\phi_{\text {text }}\left(\mathbf{t}, \mathrm{KNN}_v^{t \rightarrow t}(\mathbf{t}, \mathcal{M})\right)\),

    where \(\phi\) is the fusion model


a) Transformer fusion

\(\phi_{\text {image }}\) and \(\phi_{\text {text }}\)

= One-layer multi-head self-attention transformer encoders

\(\rightarrow\) Allows the original embedding to attend to all the retrieved elements in the fusion process.


Also tried mean fusion \(\rightarrow\) Performs poorly


b) Learning

Train the fusion model \(\phi\) on \(\mathcal{D}=\left\{\left(I_i, T_i\right)\right\}_{i=1}^N\)

How? By performing retrieval at training time from the memory \(\mathcal{M}\).

  • Encoder \(f\) is kept frozen.


Loss: Minimize the alignment loss

\(\mathcal{L}=\mathcal{L}_{\mathrm{NCE}}(\overline{\mathbf{V}}, \overline{\mathbf{T}})+\mathcal{L}_{\mathrm{NCE}}(\overline{\mathbf{V}}, \mathbf{T})+\mathcal{L}_{\mathrm{NCE}}(\mathbf{V}, \overline{\mathbf{T}})\).


4. Experiments

(1) Experimental Setup

a) Training details

(1) Model

  • Train the fusion model on top of a frozen CLIP

  • Also present a variant of RECO on top of a frozen LiT-L16L

(2) Datasets: Conceptual Captions 12M

  • Image-text dataset containing about 10 M pairs.

(3) Details

  • Batch size of 4096
  • Learning rate of \(1 e^{-3}\)
  • 10 epochs

(4) Memory: Subset of WebLI

  • Containing 1B image-text pairs

    ( Remove the near-duplicates of the test images from the memory )

  • Appendix: LAION-400M dataset


b) Evaluation datasets

Six image classification datasets

  • Stanford Cars (“Cars”)
  • CUB-200-2011 (“CUB”)
  • Oxford Flowers (“Flowers”)
  • ImageNet-1k (“Imlk”)
  • Places 365 (“Pl365”)
  • Stanford Dogs (“Dogs”)


Open-domain visual entity recognition (OVEN) benchmark

  • Containing 729 K test images
  • Belonging to 6 M entity candidates


Text-to-image & Image-to-text retieval on

  • Flickr30k (“Flickr”)
  • MS COCO (“COCO”)


c) Evaluation protocol

Zero-shot setting for all

  • No adaptation is done to the downstream task


As common in the literature ….

  • Add prompts to the text of the downstream tasks


(2) Zero-shot Transfer

a) Image classification

Zero-shot performance of CLIP & LiT on Image classification

  • Large improvements especially on the fine-grained datasets
    • e.g., Improvement of original CLIP-B/32 accuracy by …
      • +10.9 on Cars
      • +10.2 on CUB
      • +5.8 on Flowers
  • Also improved on less fine-grained benchmarks
    • e.g., ImageNet or Places
  • Performance gains are consistent across all vision-text backbones
    • e.g., CLIP-R-50, CLIP-B/32, CLIP-L/14, and LiT-L16L


figure2


b) Open-domain visual entity recognition (OVEN)

figure2

Updated: