DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

https://arxiv.org/pdf/2412.10302

Abstract
Introduction
Model Architecture
1. Dynamic Tiling Strategy
2. Vision-Language Adaptor
3. DeepSeekMoE LLM

0. Abstract

DeepSeek-VL2

Advanced series of large MoE VLMs
DeepSeek-VL + 2 key major upgrades
- (1) [Vision] Dynamic tiling vision encoding strategy
  - To process high-resolution images with different aspect ratios
- (2) [Language] DeepSeekMoE models with the MLA mechanism (feat. DeepSeek-V2)
  - Efficient inference and high throughput.

https://github.com/deepseek-ai/DeepSeek-VL2

1. Introduction

P1. Large VLMs

Remarkable capabilities of LLMs to seamlessly process both visual and textual information

P2. Proposal: DeepSeek-VL2

Open-source VLMs

(1) Mixture-of-Experts (MoE) architecture
- Improvements in both performance and efficiency (vs. DeepSeek VL)
(2) Three key aspects
- a) Dynamic, high-resolution vision encoding strategy
  - Enhances visual understanding
- b) Optimized language model architecture
  - Significantly improves both training and inference efficiency
- c) Refined vision-language data construction pipeline

P3. Component 1: Vision

Dynamic tiling vision encoding strategy

Efficiently processes high-resolution images of varying aspect ratios
Improves over DeepSeek-VL’s hybrid vision encoder
DeepSeek-VL vs. DeepSeek-VL2
- (VL) Hybrid vision encoder = two fixed resolution (384 × 384 and 1024 × 1024)
- (VL2) Dynamic tiling vision encoding strategy
How?
- Step 1) Dynamically segments high-resolution inputs into local tiles
- Step 2) Processes each tile through a shared vision transformer
- Step 3) Integrates the extracted features within the LLM
Result: Advantages of ViTs with local attention

P4. Component 2: Language

Multi-head Latent Attention (MLA) mechanism (feat. DeepSeek V2)

Significantly reduces computational cost by compressing the KV cache into a latent vector
Result: Faster inference & Increased throughput capacity
Details: Three MoE variants
- 3B, 16B, and 27B (total params)
- 0.57B, 2.4B, and 4.1B (activated params)

P5. Component 3: Dataset

Enhance VL training data in terms of…

Quality, Quantity, Diversity

$\rightarrow$ Better generalization and performance across a broad spectrum of tasks!

e.g., Visual Question Answering (VQA), Optical Character Recognition (OCR), document/table/chart understanding, visual reasoning, and general chatbot applications.

2. Model Architecture

3 core modules

(1) Vision encoder
(2) Vision-language adaptor
(3) Mixture-of-Experts language model

Two major advancements

Building upon DeepSeek-VL2: the decoder-only LLaVAstyle architecture …

(1) Dynamic tiling strategy
(2) DeepSeekMoE
- Multi-head Latent Attention (MLA)

(1) Dynamic Tiling Strategy

a) (Original DeepSeek-VL) Hybrid vision encoder

Coarse (1) + Fine-grained (2)
- (1) SigLIP for coarse-grained feature extraction at 384 × 384 resolution
- (2) SAM-B for fine-grained feature extraction at 1024 × 1024 resolution
Pros) Rich visual representations suitable for various vision-language tasks
Cons) Limited by the fixed 1024 × 1024 resolution constraint

$\rightarrow$ Particularly challenging for processing images with larger resolutions and extreme aspect ratios

b) (Proposed DeepSeek-VL2) Dynamic tiling strategy

How? Splitting a high-resolution image into tiles
Effect? Efficient processing of different high-resolution images with varying aspect ratios
Model: (Pretrained) single SigLIP-SO400M-384 vision encoder
Resolution & Ratios
- (1) Base resolution = 384x384
- (2) To accommodate different aspect ratios ….
  - Set of candidate resolutions: $C_r = {(m\cdot 384, n\cdot 384) \mid m \in N, n \in N, 1 \leq m,n,mn \leq9}$
  - $m:n$ = aspect ratio

c) Details of Dynamic tiling strategy

Input = Image of size ( $H, W$ )
Step 1) Calculate the padding area
- Required for resizing it to each candidate resolution in $C_R$
- How? Select the resolution $\left(m_i-384, n_i-384\right)$ that minimizes the padding area.
Step 3) Resize the image
Step 4) Divide the resized image into $m_i \times n_i$ local tiles of $384 \times 384$ pixels

( + One global thumbnail tile )

For computational efficiency and context length management, we disable the dynamic tiling strategy when processing multiple (>2) images.

SigLIP-SO400M-384 vision encoder

Processes all ( $1+m_i \times n_i$ ) tiles!
Yield $27 \times 27=729$ visual embeddings of 1152 dimensions per tile

(2) Vision-Language Adaptor

2x2 pixel shuffle operation

To compress each tile’s visual tokens! (1 tile = 196 tokens)
27x27=729 tokens $\rightarrow$ 14x14=196 tokens
Add three special tokens when processing the ( $1+m_i \times n_i$ ) tiles
- Refer to the figure

(3) DeepSeekMoE LLM

Based on DeepSeekMoE

Incorporates the MLA
- Compressing the Key-Value cache into a latent vector
- Enabling increased throughput capacity!
Incorporates a MoE architecture
- Introduce a global bias term for each expert
- To cost-effectively improve load balancing between experts.

Twitter Facebook LinkedIn

DeepSeek-VL2; Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Seunghan Lee