NExT-GPT: Any-to-Any Multimodal LLM

Wu, Shengqiong, et al. "Next-gpt: Any-to-any multimodal llm." arXiv preprint arXiv:2309.05519 (2023).

참고:

https://aipapersacademy.com/next-gpt/
https://arxiv.org/pdf/2309.05519

Introduction
Framework
Three Stages
1. Multimodal Encoding Stage
2. LLM Understanding and Reasoning Stage
3. Multimodal Generation Stage
Efficient Training
Examples
Training NExT-GPT
1. Lightweight Multimodal Alignment Learning
2. Modality-switching Instruction Tuning (MosIT)
Experiments

1. Introduction

Multimodal large language model (MM-LLM)

2. Framework

Framework

(1) Input: Possible input modalities: text, image, audio, video ….
(2) LLM ( = core of the framework )
Goal of LLM
- Process the input from “all modalities”
- Guide the generation of outputs for “all modalities”
But, current LLM can only understand and generate text.

\(\rightarrow\) How to solve?

3. Three Stages

3 main tiers

(1) Multimodal Encoding Stage
(2) LLM Understanding and Reasoning Stage
(3) Multimodal Generation Stage

(1) Multimodal Encoding Stage

Goal: Convert (non-text) inputs \(\rightarrow\) text prompts

Two stage

Stage 1) Multimodal input encoding
- Each modality: through each encoder
Stage 2) LLM-centric alignment
- Input projection to generate text from the embeddings
  
  ( for LLM to understand )

(2) LLM Understanding and Reasoning Stage

LLM yields ..

a) Text response
b) Instructions for the other modalities generation

\(\rightarrow\) LLM output can contain multiple parts!

....<IMG1>....<AUD1>....<VID1>....

(3) Multimodal Generation Stage

Goal: Generating the final output for all modalities (based on the LLM response)

Two stages

Stage 1) Instruction-following alignment
- (For non-text modality output) Output is passed via small transformer-based models
- Why?
  - To convert the LLM outputs
  - into representations that can be processed by the modalities decoders
Stage 2) Multimodal output generation
- Generate the output for each modality
- Model: Modality-specific diffusion decoder

4. Efficient Training

Only need to train a very small portion of the weights.

5. Examples

6. Training NExT-GPT

(1) Lightweight Multimodal Alignment Learning

How the input and output projection models are trained,?

a) Input projection (= LLM-centric alignment)
b) Output projection (= Instruction-following alignment)

a) Encoding-side LLM-centric Multimodal Alignment

(1) Model: input projection

(2) Dataset:

(image, text caption)
(audio, text caption)
(video, text caption)

(3) Procedure

Step 1) Feed the non-text input to the encoder \(\rightarrow\) representations
Step 2) Representation is fed to input projection model
- Yields an aligned representation for the LLM
Step 3) Loss function: Comparison with the “text caption”

b) Decoding-side Instruction-following Alignment

Next we review the decoding-side instruction-following alignment step, where we still use similar captioned inputs as before. We do not use any generation of image, audio and video here which is very interesting. The way it works is that the

Procedure

Step 1) LLM outputs a response with a signal token
Step 2) Output is passed via the output projection model
Step 3) Loss function: Comparison with the encoding obtained by feeding the caption to the text encoder (of the diffusion model)

\(\rightarrow\) Only need the text encoder from the diffusion model

( No need to run the diffusion process )

(2) Modality-switching Instruction Tuning (MosIT)

Goal: To follow instructions that include inputs from multiple modalities

Input: Ddialogue inputs with multiple modalities

How? Keep on training …

(1) Trainable components mentioned before
(2) LoRA weights

7. Experiments

Performance: Given by human evaluators ( Scale of 1 to 10 )

Twitter Facebook LinkedIn

NExT-GPT; Any-to-Any Multimodal LLM

Seunghan Lee