Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

https://arxiv.org/pdf/2308.12966

Abstract
Introduction
Methodology
1. Architecture
2. Inputs & Outputs
Training
1. Stage 1: Pre-training
2. Stage 2: Multi-task Pre-training
3. SFT

1. Abstract

Qwen-VL series

Set of large-scale VLMs
Endow Qwen-LM with visual capacity using
- (1) Visual receptor
- (2) Input-output interface
- (3) 3-stage training pipeline
- (4) Multilingual multimodal cleaned corpus.
Conventional Task
- Image description
- Question-answering
New Task
- Grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples
Proposed model
- QwenVL
- Qwen-VL-Chat

2. Introduction

P1. Trend of LLM & VLMs

LLMs

Further aligned with user intent through instruction tuning

Limitation of LLMs

Lacking the ability to handle other common modalities

Solution: Large Vision Language Models (VLMs)

Enhance LLMs with the ability to perceive and understand visual signals

P2. Limitation of VLMs

(Current open-source) LVLMs

(1) Suffer from inadequate training and optimization
(2) Real-world visual scenarios: Complicated

\(\rightarrow\) Fine-grained visual understanding plays a crucial role for VLMs

\(\rightarrow\) But only a few attempts had been made toward this direction!

( Most of them remain in a coarse-grained approach )

P3. Proposal: Qwen-VL series

VLMs based on Qwen-7B

Empower Qwen with visual capacity
(1) Visual receptor
- a) Language-aligned visual encoder
- b) Position-aware adapter
(2) Input-output interface are concise
(3) 3-Stage training pipeline

P4. Qwen-VL

Qwen-VL

Pretrained checkpoint = called Qwen-VL
Capable of perceiving and understanding visual inputs
Diverse tasks:
- Image captioning
- Question answering
- Text-oriented question answering
- Visual grounding.

Qwen-VL-Chat

Instruction-tuned VL chatbot based on Qwen-VL

P5. Features of the Qwen-VL series models

Leading performance
- Top-tier accuracy
  - On a vast of vision-centric understanding benchmarks
  - Compared to counterparts with similar scales.
- Outperform in both (a) & (b)
  - (a) Conventional benchmarks
    - e.g., captioning, question-answering, grounding
  - (b) Recently introduced dialogue benchmarks
Multi-lingual
- (Similar to Qwen-LM) Trained upon multilingual image-text data
  - Support English, Chinese, and multilingual instructions
Multi-image
- Input = Arbitrary interleaved image-text data
- Compare, understand, and analyze the context when multiple images are given
Fine-grained visual understanding
- Higher-resolution input size & Fine-grained corpus
- Highly competitive fine-grained visual understanding ability

3. Methodology

(1) Architecture

Three components

(1) LLM: Qwen-7B
(2) Visual Encoder: ViT
- Initialized with pre-trained weights from Openclip’s ViT-bigG
- Image is resized to a specific resolution
(3) Position-aware Vision-Language Adapter
- To alleviate the efficiency issues
- Compresses the image features
- With single-layer cross-attention module
  - Query: Group of trainable vectors
  - Keys: Image features from the visual encoder
  ( + 2D absolute positional encodings (to query-key pairs) )

(2) Inputs & Outputs

a) Image (Input)

Processing Images:

Model: Visual encoder and adapter,
Output: Fixed-length sequences of image features

How to differentiate image & text feature?

Two special tokens (<img> & </img> ) are appended to the beginning & end of the image feature

b) Bounding Box (Input and Output)

[Goal] Enhance the model’s capacity for fine-grained visual understanding and grounding

[How] Use data in the form of region descriptions, questions, and detections

\(\rightarrow\) Necessitates the model’s accurate understanding and generation of region descriptions in a designated format!

a) Normalization process: To bbox within the range [0,1000)
b) Transformation: Into a specified string format: \(\left(X_{\text {topleft }}, Y_{\text {topleft }}\right),\left(X_{\text {bottomright }}, Y_{\text {bottomright }}\right)\)
c) Tokenization: Tokenize string as text
d) New tokens: Distinguish btw detection string & text string?
- Two special tokens (<box> and </box>)
- Another set of special tokens (<ref> and </ref>)
  - To appropriately associate bounding boxes with their corresponding descriptive words to mark the content referred to by the bounding box.

3. Training

Training process of the Qwen-VL model consists of 3stages

(1) Two stages of pre-training
(2) Final stage of instruction fine-tuning

(1) Stage 1: Pre-training

a) Dataset

Large-scale, weakly labeled, web-crawled set of image-text pairs

Several publicly accessible sources
Some in-house data.

Clean the dataset of certain patterns

b) Details

(1) Freeze & Train
- Freeze: LLM
- Train: Vision encoder & VL adapter
(2) Image size = Resized to 224 × 224
(3) Objective = Cross-entropy of the text tokens (LM)

(2) Stage 2: Multi-task Pre-training

a) Dataset

High-quality and fine-grained VL annotation data

With a larger input resolution

Format: Interleaved image-text data.

b) Multi-task

Train Qwen-VL on 7 tasks simultaneously

c) Details

(1) Freeze & Train
- Freeze: -
- Train: All
(2) Image size = Increase the resolution from 224 × 224 to 448 x 448
(3) Objective = Cross-entropy of the text tokens (LM)

(3) SFT

Finetuned the Qwen-VL pre-trained model through instruction fine-tuning

\(\rightarrow\) Too enhance its instruction following and dialogue capabilities

\(\rightarrow\) Result: Interactive *Qwen-VL-Chat model

Twitter Facebook LinkedIn

Qwen-VL; A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Seunghan Lee

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Contents

1. Abstract

2. Introduction

P1. Trend of LLM & VLMs

P2. Limitation of VLMs

P3. Proposal: Qwen-VL series

P4. Qwen-VL

P5. Features of the Qwen-VL series models

3. Methodology

(1) Architecture

Three components

(2) Inputs & Outputs

a) Image (Input)

b) Bounding Box (Input and Output)

3. Training

(1) Stage 1: Pre-training

a) Dataset

b) Details

(2) Stage 2: Multi-task Pre-training

a) Dataset

b) Multi-task

c) Details

(3) SFT

You May Also Enjoy