DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).
참고:
- https://aipapersacademy.com/deepseek-r1/
- https://arxiv.org/abs/2501.12948
Contents
- Introduction
- Recap: LLMs Training Process
- Pretraining
- Supervised Fine-tuning
- Reinforcement Learning (RL)
- DeepSeek-R1-zero
- Rule-based RL
- Experiments
- DeepSeek-R1
- Need for DeepSeek-R1
- Training Pipeline
- Experiments
1. Introduction
LLMs: Paved the way toward artificial general intelligence (AGI)
OpenAI-o1: Innovative inference-time scaling techniques
\(\rightarrow\) Significantly enhance reasoning capabilities (but closed source)
DeepSeek-R1
- (1) SOTA + Open-source reasoning model
- (2) Large-scale reinforcement learning techniques
2. Recap: LLMs Training Process
(1) Pre-training
-
Pre-trained on vast amounts of text and code
\(\rightarrow\) Learn general-purpose knowledge.
-
Task: Next token prediction (NSP)
-
However, with ONLY pre-training …
\(\rightarrow\) Struggles to follow human instructions!
$\rightarrow$ Necessity of SFT
(2) Supervised Fine-tuning
-
Fine-tuned on an instruction dataset
-
Instruction dataset
- Either made by human / (original) dataset / model ( =self-instruct)
- Consists of an instruction-response pair ( response = label )
\(\rightarrow\) Model becomes better at following instructions!
(3) Reinforcement Learning (RL)
- Further improved using feedback (feat. RL)
- Reinforcement Learning from Human Feedback (RLHF)
- Human provides the feeback
- But, gathering large-scale, high-quality human feedback, especially for complex tasks, is challenging! \(\rightarrow\) RLAIF
- Reinforcement Learning from AI Feedback (RLAIF)
- AI model provides the feedback
- With highly capable model (e.g., GPT4)
3. DeepSeek-R1-zero
(Partially) eliminates the step 2) SFT
DeepSeek-R1-Zero
-
Start with a pretrained model: DeepSeek-V3-Base (671 B params)
-
Stage 2) (X) No SFT stage
-
Stage 3) (O) Instead of using the standard RL (e.g., RLHF, RLAIF) …use rule-based RL!!
\(\rightarrow\) Group Relative Policy Optimization (GRPO)
(1) Rule-based RL
Group Relative Policy Optimization (GRPO)
Procedure
-
Step 1) Input is fed to model
-
Step 2) A group of multiple outputs is sampled
- Each output = (reasoning process, answer)
-
Step 3) GRPO method
-
Observes these sampled outputs
-
Trains the model to generate the preferred options
\(\rightarrow\) By calculating a reward for each output using predefined rules
-
Summary
-
Does not use a neural model to generate rewards
\(\rightarrow\) Simplifies and reduces the cost of the training process!
Predefined rules
-
Accuracy: (e.g., math problems, code problems with deterministic results)
- Can reliably check if the final answer provided by the model is correct.
-
Format:
-
Another type of rule creates format rewards.
-
How the model is instructed to respond, with …
- Reasoning process within <think> tags
- Answer within <answer> tags
\(\rightarrow\) Format reward ensures the model follows this formatting!
-
4. Experiments
(1) Performance Insights
DeepSeek-R1-Zero
- Comparable to o1 and even surpasses it in some cases
\(\rightarrow\) Improvement progress during training!!
(2) Self-Evolution Process of DeepSeek-R1-Zero
Self-evolution process of the model
\(\rightarrow\) Through RL, the model naturally learns to allocate more thinking time when solving reasoning tasks!
(3) Aha Moment
Given a math question, the model starts its reasoning process.
However, at a certain point, the model begins to reevaluate its solution!
\(\rightarrow\) Learns to reevaluate its initial approach & and correct itself if needed!
5. DeepSeek-R1
(1) Need for DeepSeek-R1
Why do we need a second model ( = DeepSeek-R1 ) ?
( given the remarkable capabilities of DeepSeek-R1-zero )
2 Reasons
- Readability Issues
- Language Consistency
- Frequently mixes languages within a single response
\(\rightarrow\) Makes DeepSeek-R1-Zero less user-friendly
Findings (ablation study)
-
Guiding the model to be consistent with ONE language slightly damages its performance
( \(\leftrightarrow\) Humans who usually stick to a single language )
(2) Training Pipeline
Phase 1) Cold Start
-
Start with (pre-trained model) DeepSeek-V3-Base
-
Supervised fine-tuning
- On a small dataset (thousands) of results (collected from DeepSeek-R1-Zero)
\(\rightarrow\) Results: High-quality and readable.
Phase 2) Reasoning Reinforcement Learning
- Same as DeepSeek-R1-Zero (Rule-based)
Phase 3) Rejection Sampling and SFT
-
Generate many samples
- From model checkpoint of phase 2
-
3-1) Rejection sampling
-
Only correct and readable samples are retained
( Generative reward model, DeepSeek-V3, decides the accept/reject )
-
Some of DeepSeek-V3’s training data is also included in this phase
-
-
3-2) SFT with above datasets
Phase 4) Diverse Reinforcement Learning Phase
- Tasks (such as math): Rule-based rewards
- Other tasks: LLM provides feedback to align the model with human preferences
5. Experiments
-
DeepSeek-R1-32B: 32 billion parameters distilled model
\(\rightarrow\) Making it a viable smaller alternative