DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).

참고:

https://aipapersacademy.com/deepseek-r1/
https://arxiv.org/abs/2501.12948

Introduction
Recap: LLMs Training Process
1. Pretraining
2. Supervised Fine-tuning
3. Reinforcement Learning (RL)
DeepSeek-R1-zero
1. Rule-based RL
Experiments
DeepSeek-R1
1. Need for DeepSeek-R1
2. Training Pipeline
Experiments

1. Introduction

LLMs: Paved the way toward artificial general intelligence (AGI)

OpenAI-o1: Innovative inference-time scaling techniques

$\rightarrow$ Significantly enhance reasoning capabilities (but closed source)

DeepSeek-R1

(1) SOTA + Open-source reasoning model
(2) Large-scale reinforcement learning techniques

2. Recap: LLMs Training Process

(1) Pre-training

Pre-trained on vast amounts of text and code

$\rightarrow$ Learn general-purpose knowledge.
Task: Next token prediction (NSP)
However, with ONLY pre-training …

$\rightarrow$ Struggles to follow human instructions!

$\rightarrow$ Necessity of SFT

(2) Supervised Fine-tuning

Fine-tuned on an instruction dataset
Instruction dataset
- Either made by human / (original) dataset / model ( =self-instruct)
- Consists of an instruction-response pair ( response = label )
$\rightarrow$ Model becomes better at following instructions!

(3) Reinforcement Learning (RL)

Further improved using feedback (feat. RL)
Reinforcement Learning from Human Feedback (RLHF)
- Human provides the feeback
- But, gathering large-scale, high-quality human feedback, especially for complex tasks, is challenging! $\rightarrow$ RLAIF
Reinforcement Learning from AI Feedback (RLAIF)
- AI model provides the feedback
- With highly capable model (e.g., GPT4)

3. DeepSeek-R1-zero

(Partially) eliminates the step 2) SFT

DeepSeek-R1-Zero

Start with a pretrained model: DeepSeek-V3-Base (671 B params)
Stage 2) (X) No SFT stage
Stage 3) (O) Instead of using the standard RL (e.g., RLHF, RLAIF) …use rule-based RL!!

$\rightarrow$ Group Relative Policy Optimization (GRPO)

(1) Rule-based RL

Group Relative Policy Optimization (GRPO)

Procedure

Step 1) Input is fed to model
Step 2) A group of multiple outputs is sampled
- Each output = (reasoning process, answer)
Step 3) GRPO method
- Observes these sampled outputs
- Trains the model to generate the preferred options
  
  $\rightarrow$ By calculating a reward for each output using predefined rules

Summary

Does not use a neural model to generate rewards

$\rightarrow$ Simplifies and reduces the cost of the training process!

Predefined rules

Accuracy: (e.g., math problems, code problems with deterministic results)
- Can reliably check if the final answer provided by the model is correct.
Format:
- Another type of rule creates format rewards.
- How the model is instructed to respond, with …
  - Reasoning process within <think> tags
  - Answer within <answer> tags
  $\rightarrow$ Format reward ensures the model follows this formatting!

4. Experiments

(1) Performance Insights

DeepSeek-R1-Zero

Comparable to o1 and even surpasses it in some cases

$\rightarrow$ Improvement progress during training!!

(2) Self-Evolution Process of DeepSeek-R1-Zero

Self-evolution process of the model

$\rightarrow$ Through RL, the model naturally learns to allocate more thinking time when solving reasoning tasks!

(3) Aha Moment

Given a math question, the model starts its reasoning process.

However, at a certain point, the model begins to reevaluate its solution!

$\rightarrow$ Learns to reevaluate its initial approach & and correct itself if needed!

5. DeepSeek-R1

(1) Need for DeepSeek-R1

Why do we need a second model ( = DeepSeek-R1 ) ?

( given the remarkable capabilities of DeepSeek-R1-zero )

2 Reasons

Readability Issues
Language Consistency
- Frequently mixes languages within a single response

$\rightarrow$ Makes DeepSeek-R1-Zero less user-friendly

Findings (ablation study)

Guiding the model to be consistent with ONE language slightly damages its performance

( $\leftrightarrow$ Humans who usually stick to a single language )

(2) Training Pipeline

Phase 1) Cold Start

Start with (pre-trained model) DeepSeek-V3-Base
Supervised fine-tuning
- On a small dataset (thousands) of results (collected from DeepSeek-R1-Zero)
$\rightarrow$ Results: High-quality and readable.

Phase 2) Reasoning Reinforcement Learning

Same as DeepSeek-R1-Zero (Rule-based)

Phase 3) Rejection Sampling and SFT

Generate many samples
- From model checkpoint of phase 2
3-1) Rejection sampling
- Only correct and readable samples are retained
  
  ( Generative reward model, DeepSeek-V3, decides the accept/reject )
- Some of DeepSeek-V3’s training data is also included in this phase
3-2) SFT with above datasets

Phase 4) Diverse Reinforcement Learning Phase

Tasks (such as math): Rule-based rewards
Other tasks: LLM provides feedback to align the model with human preferences

5. Experiments

DeepSeek-R1-32B: 32 billion parameters distilled model

$\rightarrow$ Making it a viable smaller alternative

Twitter Facebook LinkedIn

DeepSeek-R1; Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Seunghan Lee