DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
TL;DR Highlight
Built an OpenAI o1-level reasoning model using pure RL without any human labels, and published small model distillation.
Who Should Read
ML engineers building LLM-based code generation or math-solving services, or AI app developers wanting to understand reasoning model performance and safety before production deployment.
Core Mechanics
- Running pure RL without any SFT naturally produced advanced reasoning behaviors like self-verification, reflection, and alternative exploration (DeepSeek-R1-Zero)
- AIME 2024 math competition Pass@1 improved from 15.6% to 77.9%, surpassing human average
- Applied 4-stage pipeline (Cold Start > RL > SFT > RL) to address language mixing and readability issues
- Used GRPO (Group Relative Policy Optimization) instead of PPO — no value model needed, saving memory and compute
- Fine-tuning open-source base models with 800K reasoning samples from DeepSeek-R1 enables even 1.5B models to beat GPT-4o on math benchmarks
- Few-shot prompting actually hurts performance — zero-shot with only the problem and output format specified is optimal
Evidence
- AIME 2024 Pass@1: DeepSeek-R1 79.8% vs OpenAI o1-1217 79.2% vs GPT-4o 9.3%
- Codeforces percentile: DeepSeek-R1 96.3% (rating 2029) — top 3.7% among human participants
- MATH-500 Pass@1: DeepSeek-R1 97.3% vs GPT-4o 74.6% vs Claude-3.5-Sonnet 78.3%
- Distilled 1.5B model (DeepSeek-R1-Distill-Qwen-1.5B) achieved AIME Pass@1 28.9%, more than 3x ahead of GPT-4o (9.3%)
How to Apply
- Remove few-shot examples from reasoning API calls and switch to zero-shot + explicit output format. E.g., 'Solve the following problem and put the final answer in \\boxed{}'
- For services needing small models, use DeepSeek-R1-Distill-Qwen-7B (on HuggingFace) or SFT-distill your own domain data the same way.
- When deploying open-source models to production, reference the Risk Review Prompt from the paper and attach a post-processing filter to patch jailbreak vulnerabilities.
Code Example
# DeepSeek-R1 zero-shot inference call example
# pip install openai
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.deepseek.com"
)
# ✅ zero-shot: specify only the problem + output format
prompt = """
Solve the following problem step by step.
Put your final answer inside \\boxed{}.
Problem: Find all positive integers n such that n^2 + 1 is divisible by n + 1.
"""
response = client.chat.completions.create(
model="deepseek-reasoner", # DeepSeek-R1
messages=[{"role": "user", "content": prompt}],
# ❌ adding few-shot examples actually degrades performance
)
print(response.choices[0].message.content)Terminology
Related Resources
Original Abstract (Expand)
General reasoning represents a long-standing and formidable challenge in artificial intelligence (AI). Recent breakthroughs, exemplified by large language models (LLMs)1,2 and chain-of-thought (CoT) prompting3, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent on extensive human-annotated demonstrations and the capabilities of models are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labelled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions and STEM fields, surpassing its counterparts trained through conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically used to guide and enhance the reasoning capabilities of smaller models. A new artificial intelligence model, DeepSeek-R1, is introduced, demonstrating that the reasoning abilities of large language models can be incentivized through pure reinforcement learning, removing the need for human-annotated demonstrations.