Unveiling Encoder-Free Vision-Language Models

https://arxiv.org/pdf/2406.11832


Contents

  1. Abstract

  2. Introduction
    1. Drawbacks of Encoder-based VLMs
    2. Research Question
    3. EVE-7B
  3. Related Work
    1. Encoder-based VLM
    2. Encoder-free VLM
  4. Methodology
    1. Model Architecture
    2. Training Procedure
  5. Experiments


1. Abstract

a) Limitation of existing VLMs

Rely on “vision encoders”

\(\rightarrow\) Set a strong inductive bias in abstracting visual representation

  • e.g., resolution, aspect ratio

\(\rightarrow\) Could impede the flexibility and efficiency of the VLMs.


b) Pure VLMs

Accept the seamless V&L inputs (w/o vision encoders)

\(\rightarrow\) Challenging & underexplored!


Previous works) Direct training “without encoders” (feat. Fuyu-8B)

\(\rightarrow\) Slow convergence & Large performance gaps


c) Proposal

  • Bridge the gap between encoder-based & encoder-free models

  • Simple yet effective training towards pure VLMs
  • Key aspects of training encoder-free VLMs efficiently via …
    • (1) Bridging V-L representation inside “UNIFIED” decoder
    • (2) Enhancing visual recognition capability via extra SUPERVISION


d) EVE (encoder-free vision-language model)

  • Only 35M publicly accessible data
  • Rival the encoder-based VLMs across multiple VL benchmarks.
  • v.s. encoder-freen VLMs:
    • Significantly outperforms Fuyu-8B
      • Fuyu-8B: Mysterious training procedures and undisclosed training data


1. Introduction

(1) Drawbacks of Encoder-based VLMs

figure2


a) Image Resolution / Aspect Ratio

(Existing LVMs) Pre-trained with square and “FIXED-size” images

\(\rightarrow\) Forces VLMs to resize, pad, or partition images of varying shapes

\(\rightarrow\) Large layout distortion!


b) Deployment Overhead

Undermines computational efficiency (in real-world deployment)

  • Especially when high-resolution images are divided


c) Model Capacity btw LVMs & LLMs

Scale of LLMs: From 1.3B to more than 540B

\(\rightarrow\) How to strike corresponding vision encoders to maximize their respective abilities?


(2) Research Question

Is it possible to bypass the constraints of vision encoders and integrate perception and reasoning capabilities into a SINGLE UNIFIED architecture?


Previous work in “encoder-free” VLM

  • Suffer from greatly “slow convergence” & “large performance gaps”! ( vs. Encoder-based VLMs )
    • e.g., Fuyu-8B vs. LLaVA-1.5


Essential problems of constructing encoder-free VLMs from scratch?

  • (1) Representation Unity and Emergent Ability

    • Lack of high-quality image-text data!

    • But plenty of language data

      \(\rightarrow\) \(\therefore\) Position LLMs as a central pivot

      + Compel LLMs per se to develop visual perception

      ( while preserving original linguistic proficiency )

    • Findings: Before scaling up pre-trained data…

      \(\rightarrow\) VL pre-aligning from an LLM-centric perspective is important!

      ( Prevents model collapse and optimization interference )

  • (2) Visual Recognition Capability

    • CL, MIM, NTP tasks:

      • Pros) Attempt to prompt visual backbones to produce highly compressed holistic semantics
      • Cons) But frequently neglect fine-grained visual clues!
    • Proposal: Transmit visual signals almost losslessly into encoder-free VLMs!

      \(\rightarrow\) Allow VLMs to autonomously acquire the necessary visual-semantic information

      + Also sidesteps the expensive re-training process of visual encoders for arbitrary image shapes inside encoder-based VLMs!


(3) EVE-7B

Encoder-free VLM (Decoder-only VLM)

  • Arch: Vicuna-7B
  • Trained with two 8-A100 (40G) nodes in ~9 days


Properties

  • (1) Naturally supports high-resolution images with arbitrary aspect ratios
  • (2) 35M publicly accessible data
  • (3) Rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks
    • Significantly outperforms the counterpart Fuyu-8B


2. Related Work

(1) Encoder-based VLM

In terms of open-source VLMs, existing methods

  • BLIP series [42, 43, 12]
  • LLaVA series [50, 49, 51]
  • Emu series [72, 70]
  • Intern-VL [8, 9]

\(\rightarrow\) Employ simple intermediate layers to bridge the gap between LVMs and LLMs.


Recent studies [48, 49, 20, 28]

  • Recognized the significance of input image resolution & aspect ratio

    • For visual perception and cognition,
      • e.g, Document, chart, table, and infographic data.
  • However, limited by pre-trained resolution, vision encoders are …

    • Compelled to partition images into multiple slices
    • Explore a dual-path architecture for low-resolution and high-resolution images respectively

    \(\rightarrow\) Resulting in significant image distortion, fragmented relationship between image slices, and additional computational consumption.


+ As the capacity of vision encoders scales up..

\(\rightarrow\) Deployment efficiency of vision models \(\downarrow\)


No definitive conclusion! (1) vs. (2)

  • (1) Some studies [49, 51] highlight the notable benefits via substituting CLIP-ViT-B with stronger CLIP-ViT-L-336px in enhancing multimodal models alongside Vicuna-7B [10].
  • (2) Other findings [65] indicate that larger vision encoders may not be necessary, as features of multi-scale smaller ones can approximate their performance.


This paper:

Explore a pure decoder-only VLM excluding vision encoders

+ Integrate VL understanding and reasoning capabilities into one unified architecture


Effect: Bypass the inherent problems inside encoder-based VLMs

  • ex 1) Input constraints of pre-trained vision encoders
  • ex 2) Inefficiency issues of application deployment
  • ex 3) Tricky capacity trade-offs between LVMs and LLMs


(2) Encoder-free VLM

Fuyu-8B

  • (1) Decoder-only network

    • Processes image inputs without relying on an image encoder
  • (2) Handles high-resolution images with arbitrary aspect ratios

    ( \(\because\) Image patches are fed directly into the model through a simple linear projection layer )


Limitation of Fuyu-8B

  • Only average performance across VL benchmarks
  • Lacks transparency in training strategies and data sources


Effect of Fuyu-8B

This straightforward architecture has inspired further research

  • which focuses on developing powerful supervised instruction datasets to further enhance application capabilities.


Proposal

Developing pure VLMs

+ Breaking the obstacles between encoder-based and encoder-free VLMs.


Two crucial lessons

  • (1) Before scaling up pre-trained data, it is essential to prioritize VL pre-alignment from an LLM-centric perspective.
    • Stabilizes the training process
    • Alleviates optimization interference for integrating visual and linguistic information
  • (2) Enhancing image recognition capability via visual representation supervision and language conceptual alignment generates stronger visual representations


3. Methodology

(1) Model Architecture

figure2

  • (1) Decoder-only EVE: by Vicuna-7B
  • (2) Lightweight patch embedding layer


Two losses

  • (1) Attempt to align patch features with pair-wise ones from the vision encoder (VE)
    • Through a hierarchical patch aligning layer.
  • (2) EVE predicts next-word labels


figure2

a) Patch Embedding Layer (PEL)

[Goal] To transmit images almost losslessly

  • Rather than using deep encoders or tokenizers


[Input] Image with (H, W) resolution


[Procedure]

  • Step 1) Convolution layer
    • To obtain a 2-D feature map with (h, w)
  • Step 2) Average pooling layer
  • Step 3) Cross-Attention (CA1) layer
  • **Step 4) Cross-Attention (CA2) layer **
    • Btw a special token and all patch features
    • Output: Serves as the starting symbol of the image & provides holistic information for patch features
  • Step 5) Learnable newline token
    • Considering the varying aspect ratios of image inputs, we insert a learnable newline token at the end of each row of patch features.
    • Helps the network understand the 2-D spatial structure and dependencies of the image.
  • Step 6) Flatten & NN
    • Flatten these features
    • Pass them through a two-layer NN
  • Step 7) Concat with text
    • Concatenate with text embeddings into one unified decoder-only architecture.


b) Patch Aligning Layer (PAL)

[Goal] To facilitate finegrained representations


(2) Training Procedure

figure2

Three successive stages: Train EVE with …

  • (1) Publicly available image data captioned by existing VLMs
  • (2) Diverse QA data
  • (3) Multi-modality dialogue datasets

(Remove PAL supervision during inference)


Step 1) LLM-guided Pre-training

[Goal] Initial connection between V&L modalities

[Dataset] Publicly available web-scale data (EVE-cap33M)

figure2


EVE-cap33M: Remove noisy text captions

\(\rightarrow\) Reproduce 33M high-quality descriptions

  • via Emu2 (17B) and LLaVA-1.5 (13B)


[Trainable layers]

  • Patch Embedding Layer
  • Patch Aligning Layer
  • (LLM: Vicuna-7B is frozen)


Details:

  • Only adopt 16M of 33M image-text data (EVE-cap16/33M) in this stage.
  • Use both (text) CE loss and (image) MSE loss


Findings:

  • Stage 1 does count for efficient training!!

    ( \(\because\) Prevents collapse and accelerates convergence throughout the entire process )


Step 2) Generative Pre-training

[Goal] Train all modules!

Details:

  • Use of all 33M image-text pairs (EVE-cap33M)
  • Keep both (text) CE loss and (image) MSE loss


Findings:

  • Although multi-modality performance gradually increases
  • Language capability suffers from a significant downtrend


Step 3) SFT

pass


4. Experiments

Public visual-language benchmarks

  • (1) Academic-task-oriented benchmarks (VQA-v2 [25], GQA [29], VizWiz [26], and TextVQA [67])
  • (2) Hallucination benchmarks (POPE [47])
  • (3) Open-world multi-modal understanding benchmarks (MME [23], MMBench [52], SEED-Bench [41], and MM-Vet [89])
  • (4) Scientific problem benchmarks (ScienceQA-IMG [54]).