TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Yuan, Zhengqing, et al. "Tinygpt-v: Efficient multimodal large language model via small backbones." arXiv preprint arXiv:2312.16862 (2023).

참고:

Introduction
Model Architecture
1. LLM Backbone
2. Processing Images with Phi-2
3. Trainable Params.
Training Process
Experiments

1. Introduction

TinyGPT-V

Tremendous progress with LLMs
- e.g., GPT-4, LLaMA2, …
Vision-language models (VLMs)
- LLM to undertstand images
- e.g., GPT-4V, LLaVA, MiniGPT-4

\(\rightarrow\) Limitations: Require a substantial amount of resources to run

LLM backbone = Phi-2 model

\(\rightarrow\) How to handle image inputs??

2 stages required to handle image inputs

Stage 1) Extract visual features
- 1-1) Pass through visual encoder (EVA ViT)
- 1-2) Pass through pre-trained Q-Former (from BLIP-2)
  - Q-Former: A component that is trained to align the visual features from the ViT with the text instruction!
Stage 2) Projection
- 2-1) MiniGPT-4 Projection
- 2-2) Linear Projection
  - To converts the size from MiniGPT-4 to Phi-2
- 2-3) Feed the output results to Phi-2

Step 1) Warm-up stage

Step 2) Pre-training

Step 3) Instruction Learning

Step 4) Multi-tasks Learning