Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, Xiaokang, et al. "Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling." arXiv preprint arXiv:2501.17811 (2025).

참고:

https://aipapersacademy.com/janus-pro/
https://arxiv.org/pdf/2411.07975
https://arxiv.org/pdf/2501.17811
https://github.com/deepseek-ai/Janus/blob/main/janus_pro_tech_report.pdf

Introduction
UNIFIED multimodal understanding & generation
1. Two types of tasks
2. Unifying two tasks
Architecture (Janus & Janus Pro)
1. Main Design Principle
2. Image Encoders
3. LLM Processing & Outputs
4. Details: Rectified Flow
Training Process
1. Stage 1: Adaptation
2. Stage 2: Unified Pre-Training
3. Stage 3: Supervised Fine-Tuning
Experiments

1. Introduction

DeepSeek

DeepSeek-R1: LLM
DeekSeek Janus Pro: Multimodal AI model

Two versions

v1) “JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation.”
v2) “Jannus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling.”

2. UNIFIED multimodal understanding & generation

Large Language Models (LLMs)

Multimodal Large language Models (MLLMs)

e.g.,LLaVA
Can feed the models both a (1) text prompt and an (2) image

(1) Two types of task

Task 1) Image understanding
Task 2) Image generation

(2) Unifying two tasks

a) Unified model

Unifying these two tasks into a single model!

\(\rightarrow\) Janus model

Not the first attempt to achieve this unification…
But why is it more successful?

b) Ex) Image understanding (by Janus Pro)

Q) [Image] Asked about the background story of a cake
A)
- Detects that the cake theme is Tom and Jerry
- Provides its background story
Key point: Not only (1), but also (2)
- (1) Understand the image
- (2) Leverages its backbone to provide information beyond the image’s scope
  - with general-purpose knowledge embedded in the LLM

3. Architecture (Janus & Janus Pro)

Core of the model = LLM (Autoregressive Transformer)

(1) Main Design Principle

(Previous) unified models

Use a single image encoder (regardless to tasks)

Janus Pro

Findings: Encodings needed for each type of task are different

( \(\because\) Task interference )
Solution: Decouple visual encoding for ..
- (a) understanding
- (b) generation
\(\rightarrow\) Use different encoders for each type of task!

(2) Image Encoders

a) Encoder 1: for image understanding

Step 1) Encoder: SigLIP
- Improved version of OpenAI’s CLIP model
- Extract representaiton with SigLIP
Step 2) Mapping
- Linearly mapped to the input space of the LLM

b) Encoder 2: for image generation

Step 1) Encoder: LlamaGen
- Autoregressive image generation model
- Vector quantization (VQ) tokenizer that converts an image to a list of IDs
Step 2) Mapping
- Mapped to the input space of the LLM using a trained module

(3) LLM Processing & Outputs

LLM Processing

Concatenate “text” & “image” embeddings

\(\rightarrow\) Becomes the input sequence to LLM

Outputs for ..

(1) Image understanding
- Generated using the LLM’s built-in prediction head
(2) Image generation
- Another head is added to the LLM to consume its last hidden state

(4) Details: Rectified Flow

How is image generation performed?

\(\rightarrow\) Rectified flow method

Similar to diffusion models
Tries to find shortcuts and reduce noise,

in a way that significantly reduces the number of steps needed to reach a clear image

4. Training Process

(Corresponds to both Janus & Janus Pro)

Both are trained in 3 stages!

(1) Stage 1: Adaptation

Goal: Adapt the new modules to work properly with the pre-trained components

Details:

(1) Freeze: LLM & Image encoders
(2) Tune: Newly introduced components
- (1) Linear: Map the encoded images to the LLM input space
- (2) Image generation head (Gen. Enc. \(g_{\text{enc}}\))
(3) Dataset: ImageNet
(4) Janus vs. Janus Pro
- Training steps on ImageNet are increased for Janus Pro

(2) Stage 2: Unified Pre-Training

Goal: Continue to train the new modules, but now we also train the LLM and its built-in text prediction head

Details

(1) Janus vs. Janus Pro
- Removal of ImageNet from this stage
  - Janus Pro: Text-to-image data is directly utilized
  - Janus: Starts with ImageNet data and gradually increased the ratio of text-to-image data
(2) Others
- Note that the image encoder representations are aligned in training with the image generation latent output
  
  \(\rightarrow\) To strengthen the semantic coherence in the generation process!

(3) Stage 3: Supervised Fine-Tuning

Goal: Finetune on the instruction tuning dataset