NExT-GPT: Any-to-Any Multimodal LLM

Wu, Shengqiong, et al. "Next-gpt: Any-to-any multimodal llm." arXiv preprint arXiv:2309.05519 (2023).

참고:

  • https://aipapersacademy.com/next-gpt/
  • https://arxiv.org/pdf/2309.05519


Contents

  1. Introduction
  2. Framework
  3. Three Stages
    1. Multimodal Encoding Stage
    2. LLM Understanding and Reasoning Stage
    3. Multimodal Generation Stage
  4. Efficient Training
  5. Examples
  6. Training NExT-GPT
    1. Lightweight Multimodal Alignment Learning
    2. Modality-switching Instruction Tuning (MosIT)
  7. Experiments


1. Introduction

Multimodal large language model (MM-LLM)

figure2


2. Framework

figure2

Framework

  • (1) Input: Possible input modalities: text, image, audio, video ….

  • (2) LLM ( = core of the framework )

  • Goal of LLM

    • Process the input from “all modalities”
    • Guide the generation of outputs for “all modalities”
  • But, current LLM can only understand and generate text.

    \(\rightarrow\) How to solve?


3. Three Stages

3 main tiers

  • (1) Multimodal Encoding Stage
  • (2) LLM Understanding and Reasoning Stage
  • (3) Multimodal Generation Stage


(1) Multimodal Encoding Stage

Goal: Convert (non-text) inputs \(\rightarrow\) text prompts

Two stage

  • Stage 1) Multimodal input encoding

    • Each modality: through each encoder
  • Stage 2) LLM-centric alignment

    • Input projection to generate text from the embeddings

      ( for LLM to understand )


(2) LLM Understanding and Reasoning Stage

LLM yields ..

  • a) Text response
  • b) Instructions for the other modalities generation

\(\rightarrow\) LLM output can contain multiple parts!

....<IMG1>....<AUD1>....<VID1>....


(3) Multimodal Generation Stage

Goal: Generating the final output for all modalities (based on the LLM response)

Two stages

  • Stage 1) Instruction-following alignment
    • (For non-text modality output) Output is passed via small transformer-based models
    • Why?
      • To convert the LLM outputs
      • into representations that can be processed by the modalities decoders
  • Stage 2) Multimodal output generation
    • Generate the output for each modality
    • Model: Modality-specific diffusion decoder


4. Efficient Training

Only need to train a very small portion of the weights.

figure2

figure2


5. Examples

figure2

figure2


6. Training NExT-GPT

(1) Lightweight Multimodal Alignment Learning

How the input and output projection models are trained,?

  • a) Input projection (= LLM-centric alignment)
  • b) Output projection (= Instruction-following alignment)


a) Encoding-side LLM-centric Multimodal Alignment

figure2

(1) Model: input projection

(2) Dataset:

  • (image, text caption)
  • (audio, text caption)
  • (video, text caption)

(3) Procedure

  • Step 1) Feed the non-text input to the encoder \(\rightarrow\) representations
  • Step 2) Representation is fed to input projection model
    • Yields an aligned representation for the LLM
  • Step 3) Loss function: Comparison with the “text caption”


b) Decoding-side Instruction-following Alignment

figure2

Next we review the decoding-side instruction-following alignment step, where we still use similar captioned inputs as before. We do not use any generation of image, audio and video here which is very interesting. The way it works is that the

Procedure

  • Step 1) LLM outputs a response with a signal token

  • Step 2) Output is passed via the output projection model

  • Step 3) Loss function: Comparison with the encoding obtained by feeding the caption to the text encoder (of the diffusion model)

    \(\rightarrow\) Only need the text encoder from the diffusion model

    ( No need to run the diffusion process )


(2) Modality-switching Instruction Tuning (MosIT)

figure2

Goal: To follow instructions that include inputs from multiple modalities

Input: Ddialogue inputs with multiple modalities

How? Keep on training …

  • (1) Trainable components mentioned before
  • (2) LoRA weights


7. Experiments

figure2

  • Performance: Given by human evaluators ( Scale of 1 to 10 )