CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

https://arxiv.org/pdf/2411.16828

Abstract
Introduction
Related Works
Methodology
1. Preliminaries
2. Inverse Effect with Synthetic Captions
3. Encoder: Learning with Short Captions
4. Decoder: Predicting Full Synthetic Captions
Experiments

Abstract

Noisy, web-crawled image-text pairs

\(\rightarrow\) Limit vision-language pretraining like CLIP

\(\rightarrow\) Solution: Learning with synthetic captions

Proposal: CLIPS

Two simple designs for synthetic captions

(1) Short synthetic captions \(\rightarrow\) Higher performance gain
- Feed only partial synthetic captions to the text encoder
(2) Autoregressive captioner to mimic the recaptioning process
- Data = (image, web-crawed text)
- Target = full-length synthetic caption

Experiments

Zero-shot performance in cross-modal retrieval tasks
Dataset: MSCOCO, Flickr30K
Enhance the visual capability of LLaVA

1. Introduction

Large-scale image-text datasets

e.g., LAION, DataComp [17]

\(\rightarrow\) Key driver of the rapid development of VLMs

Limitation: Web-crawled datasets are generally noisy

e.g., image-text pairs could be mismatched

\(\rightarrow\) Limit further performance improvements!

Solution: Improve the dataset “quality”

e.g., Re-generating paired textual descriptions using MLLM (synthetic captions)

Synthetic captions

A straightforward approach
Replace the raw, web-crawled captions!
- e.g., VeCLIP, Recap-DataComp-1B: (Partially) substitute the original captions with generated ones
  
  \(\rightarrow\) Enhance the models’ capabilities (especially in cross-modal retrieval tasks)

Proposal: CLIPS

Synthetic captions = Typically highly descriptive
- Longer & more detailed than (original) web-crawled captions
Two simple yet effective designs
- (1) Randomly sampling a portion of the synthetic caption to serve as input to the text encoder.
  - Also lead to reduction in computation
- (2) Predict full synthetic caption
  - Synthetic caption is only partially used in CL
    
    \(\rightarrow\) Why not use their full use in an auxiliary task?
  - Follow CoCa by incorporating an autoregressive decoder to predict captions. Difference?
    - a) CoCa: symmetric design
      - Input text = Output text (i.e., the web-crawled caption)
    - b) CLIPS: asymmetric design
      - Input = web-crawled cpation
      - Target = full-length synthetic caption

Experiments

Significantly enhances zero-shot performance in cross-modal retrieval
Results (ViT-L backbone)
- Substantially outperforms SigLip
  - by 4.7 (from 70.8 to 75.5) on MSCOCO’s R@1 text retrieval
  - by 3.3 (from 52.3 to 55.6) on MSCOCO’s R@1 image retrieval
- With increased computational resources and scaling, bet model achieves..
  - 76.4% R@1 text retrieval performance on MSCOCO
  - 96.6% R@1 text retrieval performance on Flickr30K
CLIPS framework contributes to building stronger MLLMs!
- Replacing the visual encoder from OpenAI-CLIP with our CLIPS in LLaVA:
  
  \(\rightarrow\) Leads to strong performance gains across a range of MLLM benchmarks!

(1) VL Pretraining

Existing frameworks predominantly adopt either…

(1) Single-stream architecture
Jointly represents different modalities using a shared encoder
(2) Two-stream architectures
- Employ two independent encoders to process visual and textual inputs separately.
- Proposed CLIPS = two-stream

Further enhancements to CLIP: Two main directions

(1) Extending CLIP’s capabilities into generative tasks

( e.g., Image captioning, visual question answering, and image grounding )
- CoCa and BLIP: Enables the unification of imagetext understanding and generation tasks by transitioning from an encoder-only architecture to an encoder-decoder architecture.
(2) Optimizing vision-language CL
- FILIP: Mitigates the fine-grained alignment issue in CLIP by modifying the contrastive loss.
- SigLip: Replaces contrastive loss with sigmoid loss
- Llip: Models captions diversity
  - By associating multiple captions with a single image representation
- CLOC: Region-text contrastive loss
  - Enhance ability to focus on specific image regions

\(\rightarrow\) Proposed CLIPS = Aims to improve CLIP but focuses on enhancing the leverage of richly described synthetic captions in training

(2) Learning from Synthetic Captions

Web-crawled image-text datasets are often noisy!!

Solution: Synthetic caption!

(1) ALip: Uses the OFA model to generate synthetic captions & Introduces a bipath model to integrate supervision from two types of text.
(2) LaCLIP: Rewrites captions using LLMs and randomly selects one of these
(3) Nguyen et al: Use BLIP-2 to rewrite captions for image-text pairs with low matching degrees in the original dataset.
(4) VeCLIP: Uses LLaVA to generate synthetic captions with rich visual details, then uses an LLM to fuse the raw and synthetic captions
(5) Liu et al.: Use multiple MLLMs to rewrite captions & Apply text shearing to improve caption diversity
(6) ShareGPT4V: Feeds carefully designed prompts and images to GPT-4V to create a high-quality dataset
(7) Li et al.: Rewrites captions using a more advanced LLaMA-3-based LLaVA in the much larger-scale DataComp-1B dataset
(8) SynthCLIP: Trains entirely on synthetic datasets by generating image-text pairs using text-to-image models and LLMs.
(9) DreamLip: Builds a short caption set for each image & computes multi-positive contrastive loss and sub-caption specific grouping loss

3. Methodology

(1) Preliminaries

a) CLIP

Image-text CL

\(L_I=-\frac{1}{N} \sum_{i=1}^N \log \frac{\exp \left(\operatorname{sim}\left(f\left(x_i\right), g\left(y_i\right)\right) / \tau\right)}{\sum_{j=1}^N \exp \left(\operatorname{sim}\left(f\left(x_i\right), g\left(y_j\right)\right) / \tau\right)}\).
\(L_{\text {contrast }}=\frac{1}{2}\left(L_{\text {imape }}+L_{\text {text }}\right)\).

b) CoCa

Generative loss

By autoregressively predicting the next token
- Conditioned on image features and previously generated tokens
\(L_{g \mathrm{en}}=-\sum_{t=1}^T \log P\left(y_t \mid y_{<t}, E(x)\right)\).

(2) Inverse Effect with Synthetic Captions

How CLIP models behave when learning with shorter synthetic captions?

Four strategies (Figure 2)

Notation

Synthetic caption sequence of length \(K\)
Target token length \(L\),

a) Trunctation

Directly selects the first \(L\) tokens

b) Random mask

Obtains \(L\) tokens through random sampling

c) Block mask

Randomly selects a starting point in the original sequence and take the subsequent \(L\) tokens.

d) Sub-caption mask

Step 1) Split the synthetic caption at periods \(\rightarrow\) Segments \(\left\{S_1, S_2, \ldots, S_n\right\}\),
- \(n\) = Number of sub-captions
Step 2) Randomly select a sub-caption
Step 3) Check its length
- If the length meets and exceeds the predefined limit: Truncate
- Otherwise: Randomly select another sub-caption from the remaining ones, concatenate it with the previous segment, and then check the length again
\(\operatorname{Subcaption}(S, L)= \begin{cases}\operatorname{Truncate}\left(S_i\right), & \text { if } \mid S_i \mid >L \\ \operatorname{Concat}\left(\left\{S_i, S_j, \ldots\right\}\right), & \text { if } \mid S_i \mid <L\end{cases}\).

(3) Encoder: Learning with Short Captions

\(L_{\text {contrast }}=-\frac{1}{2 N} \sum_{i=1}^N\left(L_{\text {erig }}^i+L_{\text {syn-short }}^i\right)\).

\(L_{\text {orig }}^i =\log \frac{\exp \left(S_{i, \text { orig }}^i\right)}{\sum_{k=1}^N \exp \left(S_{i, \text { orig }}^k\right)}\).
\(L_{\text {sya-short }}^i =\log \frac{\exp \left(S_{i, \text { sya-short }}^i\right)}{\sum_{k=1}^N \exp \left(S_{i, \text { syp-short }}^k\right)}\).

(4) Decoder: Predicting Full Synthetic Captions

a) Naive version

\(L_{\text {pen }}=-\sum_{t=1}^T \log P\left(y_t \mid \mathbf{I}_{\text {mags }}, \mathbf{C}_{\text {web }}, \mathbf{L}_{\text {leamable },<t}\right)\).

b) w/ Self-attention

Experiments show that (a) > (b)

(a) Simply concatenating information from different modalities
(b) Employing modality fusion with cross-attention mechanisms

\(\rightarrow\) Therefore, utilizes a self-attention mechanism ( + combination mask )

Combination mask \(M\): To ensure that tokens within the condition can attend to each other,

\(L_{g \mathrm{gec}}=-\sum_{t=1}^T \log P_\theta\left(y_t \mid \mathbf{I}_{\text {lump }} \oplus \mathbf{C}_{\text {wch }} \oplus \mathbf{L}_{\text {learable },<t}, M\right)\).

\(M[i, j]= \begin{cases}1 & \text { if } i, j \leq L_{\text {cond }} \\ 1 & \text { if } i>L_{\text {cond }} \text { and } j \leq L_{\text {cond }} \\ 1 & \text { if } i, j>L_{\text {cond }} \text { and } i \geq j \\ 0 & \text { otherwise }\end{cases}\).

Final Loss: \(L_{\text {total }}=\alpha \cdot L_{\text {coemast }}+\beta \cdot L_{g e n}\).

4. Experiments

(1) Pretraining Details

(Pretraining dataset) Recap-DataCompIB dataset

(PT Epochs) 2,000 ImageNet-equivalent epochs

~ 2.6B samples seen
Images at a low resolution (e.g., resizing to \(112 \times 112\) by default)

(FT Epochs) 100 ImageNet-equivalent epochs

Resolution of \(224 \times 224\).

(Text branch)

Iinput token length = 80
Output token length = 128

( = Matching the number of learnable tokens in the decoder )

(Batch size)

PT: 32768
FT: 16384

To further enhance training …. (for comparison with SOTA)

(PT Epochs) 10,000 ImageNet-equivalent epochs
- ~ 13B samples seen
(FT Epochs) 400 ImageNet-equivalent epochs
Resolution
- PT: 84 (for accelerating training)
- FT: 224

(2) Evaluation

Varying model size
Varying datasets: MSCOCO & Flickr30K
Baseline: CLIPA and CoCa
Train with a mixture of web-crawled captions (80%) and synthetic captions (20%).

Result

Consistently achieves superior performance across all benchmarks and model sizes
Enable our smaller models to match or even surpass the performance of larger models

b) Comparision with SOTA methods

Metric:

ImageNet-1K: Top-1 accuracy
MSCOCO, Flickr30K: Zero-shot recall rates (for image and text retrieval tasks)

Result

Despite the strong cross-modal retrieval performance,
CLIPS is less competitive on ImageNet zero-shot classification accuracy

c) CLIPS in LLaVA

Integrate the CLIPS visual encoder into LLaVA-1.5 for evaluation

Details

(1) Visual encoder
- Original) OpenAI-CLIP-L/14 visual encoder
- Replacement) CLIPS-L/14
(2) Text Encoder: LLaMA-3
Finetune CLIPS-L/14 at a resolution of \(336 \times 336\) to match the configuration of OpenAI-CLIP-L/14.

We then evaluate LLaVA’s performance on multiple MLLM benchmarks

MME [16]. MMMU [53], GQA [19]. ChartQA [39], POPE [32], NoCaps [2], and TextVQA [47].

Twitter Facebook LinkedIn

CLIPS; An Enhanced CLIP Framework for Learning with Synthetic Captions

Seunghan Lee

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Contents

Abstract

Proposal: CLIPS

1. Introduction

Proposal: CLIPS

Experiments

(1) VL Pretraining

(2) Learning from Synthetic Captions

3. Methodology

(1) Preliminaries

a) CLIP

b) CoCa

(2) Inverse Effect with Synthetic Captions

(3) Encoder: Learning with Short Captions

(4) Decoder: Predicting Full Synthetic Captions

a) Naive version

b) w/ Self-attention

4. Experiments

(1) Pretraining Details

(2) Evaluation

b) Comparision with SOTA methods

c) CLIPS in LLaVA

You May Also Enjoy

CLIPS; An Enhanced CLIP Framework for Learning with Synthetic Captions

Seunghan Lee

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Contents

Abstract

Proposal: CLIPS

1. Introduction

Proposal: CLIPS

Experiments

2. Related Works

(1) VL Pretraining

(2) Learning from Synthetic Captions

3. Methodology

(1) Preliminaries

a) CLIP

b) CoCa

(2) Inverse Effect with Synthetic Captions

(3) Encoder: Learning with Short Captions

(4) Decoder: Predicting Full Synthetic Captions

a) Naive version

b) w/ Self-attention

4. Experiments

(1) Pretraining Details

(2) Evaluation

a) Zero-shot cross-modal retrieval

b) Comparision with SOTA methods

c) CLIPS in LLaVA

You May Also Enjoy