Unveiling Encoder-Free Vision-Language Models

https://arxiv.org/pdf/2406.11832

Abstract
Introduction
1. Drawbacks of Encoder-based VLMs
2. Research Question
3. EVE-7B
Related Work
1. Encoder-based VLM
2. Encoder-free VLM
Methodology
1. Model Architecture
2. Training Procedure
Experiments

1. Abstract

a) Limitation of existing VLMs

Rely on “vision encoders”

\(\rightarrow\) Set a strong inductive bias in abstracting visual representation

e.g., resolution, aspect ratio

\(\rightarrow\) Could impede the flexibility and efficiency of the VLMs.

b) Pure VLMs

Accept the seamless V&L inputs (w/o vision encoders)

\(\rightarrow\) Challenging & underexplored!

Previous works) Direct training “without encoders” (feat. Fuyu-8B)

\(\rightarrow\) Slow convergence & Large performance gaps

c) Proposal

Bridge the gap between encoder-based & encoder-free models
Simple yet effective training towards pure VLMs
Key aspects of training encoder-free VLMs efficiently via …
- (1) Bridging V-L representation inside “UNIFIED” decoder
- (2) Enhancing visual recognition capability via extra SUPERVISION

d) EVE (encoder-free vision-language model)

Only 35M publicly accessible data
Rival the encoder-based VLMs across multiple VL benchmarks.
v.s. encoder-freen VLMs:
- Significantly outperforms Fuyu-8B
  - Fuyu-8B: Mysterious training procedures and undisclosed training data

1. Introduction

(1) Drawbacks of Encoder-based VLMs

a) Image Resolution / Aspect Ratio

(Existing LVMs) Pre-trained with square and “FIXED-size” images

\(\rightarrow\) Forces VLMs to resize, pad, or partition images of varying shapes

\(\rightarrow\) Large layout distortion!

b) Deployment Overhead

Undermines computational efficiency (in real-world deployment)

Especially when high-resolution images are divided

c) Model Capacity btw LVMs & LLMs

Scale of LLMs: From 1.3B to more than 540B

\(\rightarrow\) How to strike corresponding vision encoders to maximize their respective abilities?

(2) Research Question

Is it possible to bypass the constraints of vision encoders and integrate perception and reasoning capabilities into a SINGLE UNIFIED architecture?

Previous work in “encoder-free” VLM

Suffer from greatly “slow convergence” & “large performance gaps”! ( vs. Encoder-based VLMs )
- e.g., Fuyu-8B vs. LLaVA-1.5

Essential problems of constructing encoder-free VLMs from scratch?

(1) Representation Unity and Emergent Ability
- Lack of high-quality image-text data!
- But plenty of language data
  
  \(\rightarrow\) \(\therefore\) Position LLMs as a central pivot
  
  + Compel LLMs per se to develop visual perception
  
  ( while preserving original linguistic proficiency )
- Findings: Before scaling up pre-trained data…
  
  \(\rightarrow\) VL pre-aligning from an LLM-centric perspective is important!
  
  ( Prevents model collapse and optimization interference )
(2) Visual Recognition Capability
- CL, MIM, NTP tasks:
  - Pros) Attempt to prompt visual backbones to produce highly compressed holistic semantics
  - Cons) But frequently neglect fine-grained visual clues!
- Proposal: Transmit visual signals almost losslessly into encoder-free VLMs!
  
  \(\rightarrow\) Allow VLMs to autonomously acquire the necessary visual-semantic information
  
  + Also sidesteps the expensive re-training process of visual encoders for arbitrary image shapes inside encoder-based VLMs!

(3) EVE-7B

Encoder-free VLM (Decoder-only VLM)

Arch: Vicuna-7B
Trained with two 8-A100 (40G) nodes in ~9 days

Properties

(1) Naturally supports high-resolution images with arbitrary aspect ratios
(2) 35M publicly accessible data
(3) Rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks
- Significantly outperforms the counterpart Fuyu-8B

(1) Encoder-based VLM

In terms of open-source VLMs, existing methods

BLIP series [42, 43, 12]
LLaVA series [50, 49, 51]
Emu series [72, 70]
Intern-VL [8, 9]

\(\rightarrow\) Employ simple intermediate layers to bridge the gap between LVMs and LLMs.

Recent studies [48, 49, 20, 28]

Recognized the significance of input image resolution & aspect ratio
- For visual perception and cognition,
  - e.g, Document, chart, table, and infographic data.
However, limited by pre-trained resolution, vision encoders are …
- Compelled to partition images into multiple slices
- Explore a dual-path architecture for low-resolution and high-resolution images respectively
\(\rightarrow\) Resulting in significant image distortion, fragmented relationship between image slices, and additional computational consumption.

+ As the capacity of vision encoders scales up..

\(\rightarrow\) Deployment efficiency of vision models \(\downarrow\)

No definitive conclusion! (1) vs. (2)

(1) Some studies [49, 51] highlight the notable benefits via substituting CLIP-ViT-B with stronger CLIP-ViT-L-336px in enhancing multimodal models alongside Vicuna-7B [10].
(2) Other findings [65] indicate that larger vision encoders may not be necessary, as features of multi-scale smaller ones can approximate their performance.

This paper:

Explore a pure decoder-only VLM excluding vision encoders

+ Integrate VL understanding and reasoning capabilities into one unified architecture

Effect: Bypass the inherent problems inside encoder-based VLMs

ex 1) Input constraints of pre-trained vision encoders
ex 2) Inefficiency issues of application deployment
ex 3) Tricky capacity trade-offs between LVMs and LLMs

(2) Encoder-free VLM

Fuyu-8B

(1) Decoder-only network
- Processes image inputs without relying on an image encoder
(2) Handles high-resolution images with arbitrary aspect ratios

( \(\because\) Image patches are fed directly into the model through a simple linear projection layer )

Limitation of Fuyu-8B

Only average performance across VL benchmarks
Lacks transparency in training strategies and data sources

Effect of Fuyu-8B

This straightforward architecture has inspired further research

which focuses on developing powerful supervised instruction datasets to further enhance application capabilities.

Proposal

Developing pure VLMs

+ Breaking the obstacles between encoder-based and encoder-free VLMs.

Two crucial lessons

(1) Before scaling up pre-trained data, it is essential to prioritize VL pre-alignment from an LLM-centric perspective.
- Stabilizes the training process
- Alleviates optimization interference for integrating visual and linguistic information
(2) Enhancing image recognition capability via visual representation supervision and language conceptual alignment generates stronger visual representations

3. Methodology

(1) Model Architecture

(1) Decoder-only EVE: by Vicuna-7B
(2) Lightweight patch embedding layer

Two losses

(1) Attempt to align patch features with pair-wise ones from the vision encoder (VE)
- Through a hierarchical patch aligning layer.
(2) EVE predicts next-word labels

a) Patch Embedding Layer (PEL)

[Goal] To transmit images almost losslessly

Rather than using deep encoders or tokenizers

[Input] Image with (H, W) resolution

[Procedure]

Step 1) Convolution layer
- To obtain a 2-D feature map with (h, w)
Step 2) Average pooling layer
Step 3) Cross-Attention (CA1) layer
**Step 4) Cross-Attention (CA2) layer **
- Btw a special token and all patch features
- Output: Serves as the starting symbol of the image & provides holistic information for patch features
Step 5) Learnable newline token
- Considering the varying aspect ratios of image inputs, we insert a learnable newline token at the end of each row of patch features.
- Helps the network understand the 2-D spatial structure and dependencies of the image.
Step 6) Flatten & NN
- Flatten these features
- Pass them through a two-layer NN
Step 7) Concat with text
- Concatenate with text embeddings into one unified decoder-only architecture.

b) Patch Aligning Layer (PAL)

[Goal] To facilitate finegrained representations

(2) Training Procedure

Three successive stages: Train EVE with …

(1) Publicly available image data captioned by existing VLMs
(2) Diverse QA data
(3) Multi-modality dialogue datasets

(Remove PAL supervision during inference)

Step 1) LLM-guided Pre-training

[Goal] Initial connection between V&L modalities

[Dataset] Publicly available web-scale data (EVE-cap33M)

EVE-cap33M: Remove noisy text captions

\(\rightarrow\) Reproduce 33M high-quality descriptions

via Emu2 (17B) and LLaVA-1.5 (13B)

[Trainable layers]

Patch Embedding Layer
Patch Aligning Layer
(LLM: Vicuna-7B is frozen)

Details:

Only adopt 16M of 33M image-text data (EVE-cap16/33M) in this stage.
Use both (text) CE loss and (image) MSE loss

Findings:

Stage 1 does count for efficient training!!

( \(\because\) Prevents collapse and accelerates convergence throughout the entire process )

Step 2) Generative Pre-training

[Goal] Train all modules!

Details:

Use of all 33M image-text pairs (EVE-cap33M)
Keep both (text) CE loss and (image) MSE loss

Findings:

Although multi-modality performance gradually increases
Language capability suffers from a significant downtrend

Step 3) SFT

pass

4. Experiments

Public visual-language benchmarks

(1) Academic-task-oriented benchmarks (VQA-v2 [25], GQA [29], VizWiz [26], and TextVQA [67])
(2) Hallucination benchmarks (POPE [47])
(3) Open-world multi-modal understanding benchmarks (MME [23], MMBench [52], SEED-Bench [41], and MM-Vet [89])
(4) Scientific problem benchmarks (ScienceQA-IMG [54]).

Twitter Facebook LinkedIn

Unveiling Encoder-Free Vision-Language Models

Seunghan Lee

Unveiling Encoder-Free Vision-Language Models

Contents

1. Abstract

a) Limitation of existing VLMs

b) Pure VLMs

c) Proposal

d) EVE (encoder-free vision-language model)

1. Introduction

(1) Drawbacks of Encoder-based VLMs

a) Image Resolution / Aspect Ratio

b) Deployment Overhead

c) Model Capacity btw LVMs & LLMs

(2) Research Question

(3) EVE-7B

(1) Encoder-based VLM

(2) Encoder-free VLM

Fuyu-8B

Limitation of Fuyu-8B

Effect of Fuyu-8B

Proposal

3. Methodology

(1) Model Architecture

a) Patch Embedding Layer (PEL)

b) Patch Aligning Layer (PAL)

(2) Training Procedure

Step 1) LLM-guided Pre-training

Step 2) Generative Pre-training

Step 3) SFT

4. Experiments

You May Also Enjoy

Unveiling Encoder-Free Vision-Language Models

Seunghan Lee

Unveiling Encoder-Free Vision-Language Models

Contents

1. Abstract

a) Limitation of existing VLMs

b) Pure VLMs

c) Proposal

d) EVE (encoder-free vision-language model)

1. Introduction

(1) Drawbacks of Encoder-based VLMs

a) Image Resolution / Aspect Ratio

b) Deployment Overhead

c) Model Capacity btw LVMs & LLMs

(2) Research Question

(3) EVE-7B

2. Related Work

(1) Encoder-based VLM

(2) Encoder-free VLM

Fuyu-8B

Limitation of Fuyu-8B

Effect of Fuyu-8B

Proposal

3. Methodology

(1) Model Architecture

a) Patch Embedding Layer (PEL)

b) Patch Aligning Layer (PAL)

(2) Training Procedure

Step 1) LLM-guided Pre-training

Step 2) Generative Pre-training

Step 3) SFT

4. Experiments

You May Also Enjoy