Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, Xiaokang, et al. "Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling." arXiv preprint arXiv:2501.17811 (2025).

참고:

  • https://aipapersacademy.com/janus-pro/
  • https://arxiv.org/pdf/2411.07975
  • https://arxiv.org/pdf/2501.17811
  • https://github.com/deepseek-ai/Janus/blob/main/janus_pro_tech_report.pdf


Contents

  1. Introduction
  2. UNIFIED multimodal understanding & generation
    1. Two types of tasks
    2. Unifying two tasks
  3. Architecture (Janus & Janus Pro)
    1. Main Design Principle
    2. Image Encoders
    3. LLM Processing & Outputs
    4. Details: Rectified Flow
  4. Training Process
    1. Stage 1: Adaptation
    2. Stage 2: Unified Pre-Training
    3. Stage 3: Supervised Fine-Tuning
  5. Experiments


1. Introduction

DeepSeek

  • DeepSeek-R1: LLM
  • DeekSeek Janus Pro: Multimodal AI model


Two versions

  • v1) “JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation.”
  • v2) “Jannus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling.”


2. UNIFIED multimodal understanding & generation

Large Language Models (LLMs)

Multimodal Large language Models (MLLMs)

  • e.g.,LLaVA
  • Can feed the models both a (1) text prompt and an (2) image


(1) Two types of task

  • Task 1) Image understanding
  • Task 2) Image generation

figure2

figure2


(2) Unifying two tasks

a) Unified model

figure2

Unifying these two tasks into a single model!

\(\rightarrow\) Janus model

  • Not the first attempt to achieve this unification…
  • But why is it more successful?


b) Ex) Image understanding (by Janus Pro)

figure2

  • Q) [Image] Asked about the background story of a cake
  • A)
    • Detects that the cake theme is Tom and Jerry
    • Provides its background story
  • Key point: Not only (1), but also (2)
    • (1) Understand the image
    • (2) Leverages its backbone to provide information beyond the image’s scope
      • with general-purpose knowledge embedded in the LLM


3. Architecture (Janus & Janus Pro)

figure2

Core of the model = LLM (Autoregressive Transformer)


(1) Main Design Principle

(Previous) unified models

  • Use a single image encoder (regardless to tasks)


Janus Pro

  • Findings: Encodings needed for each type of task are different

    ( \(\because\) Task interference )

  • Solution: Decouple visual encoding for ..

    • (a) understanding
    • (b) generation

    \(\rightarrow\) Use different encoders for each type of task!


(2) Image Encoders

figure2

a) Encoder 1: for image understanding

  • Step 1) Encoder: SigLIP
    • Improved version of OpenAI’s CLIP model
    • Extract representaiton with SigLIP
  • Step 2) Mapping
    • Linearly mapped to the input space of the LLM


b) Encoder 2: for image generation

  • Step 1) Encoder: LlamaGen
    • Autoregressive image generation model
    • Vector quantization (VQ) tokenizer that converts an image to a list of IDs
  • Step 2) Mapping
    • Mapped to the input space of the LLM using a trained module


(3) LLM Processing & Outputs

LLM Processing

  • Concatenate “text” & “image” embeddings

    \(\rightarrow\) Becomes the input sequence to LLM


Outputs for ..

  • (1) Image understanding
    • Generated using the LLM’s built-in prediction head
  • (2) Image generation
    • Another head is added to the LLM to consume its last hidden state


(4) Details: Rectified Flow

How is image generation performed?

\(\rightarrow\) Rectified flow method

  • Similar to diffusion models

  • Tries to find shortcuts and reduce noise,

    in a way that significantly reduces the number of steps needed to reach a clear image


4. Training Process

figure2

(Corresponds to both Janus & Janus Pro)

Both are trained in 3 stages!


(1) Stage 1: Adaptation

Goal: Adapt the new modules to work properly with the pre-trained components

Details:

  • (1) Freeze: LLM & Image encoders
  • (2) Tune: Newly introduced components
    • (1) Linear: Map the encoded images to the LLM input space
    • (2) Image generation head (Gen. Enc. \(g_{\text{enc}}\))
  • (3) Dataset: ImageNet
  • (4) Janus vs. Janus Pro
    • Training steps on ImageNet are increased for Janus Pro


(2) Stage 2: Unified Pre-Training

Goal: Continue to train the new modules, but now we also train the LLM and its built-in text prediction head

Details

  • (1) Janus vs. Janus Pro

    • Removal of ImageNet from this stage
      • Janus Pro: Text-to-image data is directly utilized
      • Janus: Starts with ImageNet data and gradually increased the ratio of text-to-image data
  • (2) Others

    • Note that the image encoder representations are aligned in training with the image generation latent output

      \(\rightarrow\) To strengthen the semantic coherence in the generation process!


(3) Stage 3: Supervised Fine-Tuning

Goal: Finetune on the instruction tuning dataset

Details

  • (1) Dataset: instruction tuning dataset
    • Comprises dialogues and high-quality text-to-image samples
  • (2) Others
    • Image understanding encoder is also trained
  • (3) Janus vs. Janus Pro
    • No difference


[LLM size] Janus vs. Janus Pro

  • Janus: 1.5B
  • Janus Pro: 7B


5. Experiments

figure2

figure2