TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Yuan, Zhengqing, et al. "Tinygpt-v: Efficient multimodal large language model via small backbones." arXiv preprint arXiv:2312.16862 (2023).

참고:

  • https://aipapersacademy.com/tinygpt-v/
  • https://arxiv.org/pdf/2312.16862


Contents

  1. Introduction
  2. Model Architecture
    1. LLM Backbone
    2. Processing Images with Phi-2
    3. Trainable Params.
  3. Training Process

  4. Experiments


1. Introduction

TinyGPT-V

  • New multimodal LLM (MMLLM)
  • with Small Backbones


Motivation

  • Tremendous progress with LLMs
    • e.g., GPT-4, LLaMA2, …
  • Vision-language models (VLMs)

    • LLM to undertstand images

    • e.g., GPT-4V, LLaVA, MiniGPT-4

\(\rightarrow\) Limitations: Require a substantial amount of resources to run


2. Model Architecture

figure2


(1) LLM Backbone

LLM backbone = Phi-2 model

  • 2.7B params ( still, beats much larger models )
  • Phi-2 (2.7B): Contains most of the params of TinyGPT-V (2.8B)
  • Of course, can hanle text inputs!

\(\rightarrow\) How to handle image inputs??


(2) Processing Images with Phi-2

2 stages required to handle image inputs

  • Stage 1) Extract visual features
    • 1-1) Pass through visual encoder (EVA ViT)
    • 1-2) Pass through pre-trained Q-Former (from BLIP-2)
      • Q-Former: A component that is trained to align the visual features from the ViT with the text instruction!
  • Stage 2) Projection
    • 2-1) MiniGPT-4 Projection
    • 2-2) Linear Projection
      • To converts the size from MiniGPT-4 to Phi-2
    • 2-3) Feed the output results to Phi-2


(3) Trainable Params.

figure2


3. Training Process

figure2

Step 1) Warm-up stage

  • Data: image-text pairs
  • Goal: Enable Phi-2 to process images


Step 2) Pre-training

  • Data: Same as step
  • Difference with step 1:
    • LoRA weights are added ( & updated )


Step 3) Instruction Learning

  • Dataset: Instructions
    • Examples from MiniGPT-4 data
  • Learnable params: same as step 2


Step 4) Multi-tasks Learning

  • Trained on multiple datasets for various vision-language tasks


4. Experiments

figure2