Visual Instruction Tuning

https://arxiv.org/pdf/2304.08485

1. Abstract

Previous works)

Instruction tuning with “machine-generated” instruction-following data
Improve zero-shot capabilities on new tasks

\(\rightarrow\) Limitation: Less explored in the multimodal field

Proposal: LLaVA (Large Language and Vision Assistant)

(1) Goal: Generate “instruction-following” dataset”
(2) How: Use language-only GPT-4 to generate “multimodal” language-image instruction-following data
(3) LLaVA
- Instruction tuned on the above generated data
- End-to-end trained large multimodal model
- Connects a vision encoder & LLM for general-purpose visual and language understanding

2. Contributions

Multimodal instruction-following data
- Previous) Lack of vision-language instruction-following data
- Proposal: Construct such instruction-following dataset
  - With language-only ChatGPT/GPT-4.
  - Via a data reformation perspective
Large multimodal models
- Large multimodal model (LMM)
- Connecting (1) & (2)
  - (1) Open-set visual encoder of CLIP
  - (2) Language decoder Vicuna
- Fine-tuning end-to-end on our generated instructional vision-language data.
Multimodal instruction-following benchmark
- Present LLaVA-Bench with two challenging benchmarks
Open-source

3. GPT-assisted Visual Instruction Data Generation

(1) Lack of multimodal instruction following data

Public multimodal data (e.g., image-text pairs): CC, LAION

\(\rightarrow\) But limited in multimodal instruction following data!

Creating such data is time-consuming and less well-defined!

Proposal:

Leverage ChatGPT/GPT-4 for multimodal instruction-following data collection

(2) Naive way

(1) Image \(\mathrm{X}_{\mathrm{v}}\) & Caption \(\mathrm{X}_{\mathrm{c}}\)
(2) Set of questions \(\mathrm{X}_{\mathrm{q}}\)
- To instruct the assistant to describe the image content
\(\rightarrow\) Prompt GPT-4 to curate such a list of questions!

Simple way?

Human : \(\mathbf{X}_{\mathbf{q}} \mathbf{X}_{\boldsymbol{v}}<\) STOP> Assistant : \(\mathbf{X}_c<\) STOP \(>\)
Pros & Cons
- Pros) Cheap to construct
- Cons) Lacks diversity and in-depth reasoning!

(3) Proposal

Leverage language-only GPT-4 or ChatGPT as the strong teacher

\(\rightarrow\) Accept only text as input to create instruction-following data involving visual content! How??

To encode an image into its visual features to prompt a text-only GPT…

\(\rightarrow\) Use 2 types of symbolic representations

(1) Captions
- Typically describe the visual scene from various perspectives
(2) Bounding boxes
- Usually localize the objects in the scene
- Each box encodes the object concept and its spatial location

\(\rightarrow\) Enables to encode the image as an LLM-recognizable sequence

(Generated) instruction tuning dataset

Dataset used: COCO images

Datasets generated: Three types of instruction-following data!

(1) Conversation
(2) Detailed description
(3) Complex reasoning

4. Visual Instruction Tuning

(1) Architecture

Goal: Leverage the capabilities of both the (1) pre-trained LLM and (2) visual model

Architecture

(1) LLM (Vicuna): \(f_\phi(\cdot)\)
- Has the best instruction following capabilities in language tasks among publicly available checkpoints
(2) Vision encoder (CLIP, ViT-L/14)
- Provides the visual feature \(\mathbf{Z}_{\mathrm{v}}=g\left(\mathbf{X}_{\mathrm{w}}\right)\).
(3) Simple linear layer
- To connect image features into the word embedding space
- Trainable projection matrix \(\mathbf{W}\)
- \(\mathbf{H}_{\mathrm{v}}=\mathbf{W} \cdot \mathbf{Z}_v, \text { with } \mathbf{Z}_v=g\left(\mathbf{X}_v\right)\).

Regarding (3) Simple linear layer…

LLaVA: Simple
Cross-attention in Flamingo & Q-former in BLIP-2: Complex

(2) Training

a) Dataset format

For each image \(\mathbf{X}_{\mathrm{q}}\) ….

\(\rightarrow\) Multi-turn conversation data \(\left(\mathbf{X}_q^1, \mathbf{X}_{\mathrm{a}}^1, \ldots, \mathbf{X}_q^T, \mathbf{X}_{\mathrm{a}}^T\right)\)

Instruction \(\mathbf{X}_{\text {instruct }}^t\) at the \(t\)-th turn:

\(\mathbf{X}_{\text {inatruct }}^t=\left\{\begin{array}{l} \text { Randomly choose }\left[\mathbf{X}_{\mathrm{q}}^1, \mathbf{X}_{\mathrm{v}}\right] \text { or }\left[\mathbf{X}_{\mathrm{v}}, \mathbf{X}_{\mathrm{q}}^1\right], \text { the first turn } t=1 \\ \text { the remaining turns } t>1 \end{array}\right.\).

Result: Unified format for the multimodal instruction-following sequence

b) Loss function

\(p\left(\mathbf{X}_{\mathrm{a}} \mid \mathbf{X}_{\mathrm{v}}, \mathbf{X}_{\text {inatract }}\right)=\prod_{i=1}^L p_\theta\left(x_i \mid \mathbf{X}_{\mathrm{v}}, \mathbf{X}_{\text {instruct},<i,}, \mathbf{X}_{\mathrm{a},<i}\right),\).

c) [Stage 1] Pre-training for Feature Alignment

Filter CC3M to 595K image-text pairs
Freeze & Train
- Freeze: LLM & Visual Encoder
- Train: Projection matrix

d) [Stage 2] Fine-tuning End-to-End

Freeze & Train
- Freeze: Visual Encoder
- Train: LLM & Projection matrix

Twitter Facebook LinkedIn

Visual Instruction Tuning

Seunghan Lee

Visual Instruction Tuning

1. Abstract

2. Contributions

3. GPT-assisted Visual Instruction Data Generation

(1) Lack of multimodal instruction following data

(2) Naive way

(3) Proposal

(Generated) instruction tuning dataset

4. Visual Instruction Tuning

(1) Architecture

(2) Training

a) Dataset format

b) Loss function

c) [Stage 1] Pre-training for Feature Alignment

d) [Stage 2] Fine-tuning End-to-End

You May Also Enjoy