DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Jan 22, 2025•DeepSeek-AI, Daya Guo, Dejian Yang +195•View PDF

TL;DR Highlight

Built an OpenAI o1-level reasoning model using pure RL without any human labels, and published small model distillation.

Who Should Read

ML engineers building LLM-based code generation or math-solving services, or AI app developers wanting to understand reasoning model performance and safety before production deployment.

Core Mechanics

Running pure RL without any SFT naturally produced advanced reasoning behaviors like self-verification, reflection, and alternative exploration (DeepSeek-R1-Zero)
AIME 2024 math competition Pass@1 improved from 15.6% to 77.9%, surpassing human average
Applied 4-stage pipeline (Cold Start > RL > SFT > RL) to address language mixing and readability issues
Used GRPO (Group Relative Policy Optimization) instead of PPO — no value model needed, saving memory and compute
Fine-tuning open-source base models with 800K reasoning samples from DeepSeek-R1 enables even 1.5B models to beat GPT-4o on math benchmarks
Few-shot prompting actually hurts performance — zero-shot with only the problem and output format specified is optimal

Evidence

AIME 2024 Pass@1: DeepSeek-R1 79.8% vs OpenAI o1-1217 79.2% vs GPT-4o 9.3%
Codeforces percentile: DeepSeek-R1 96.3% (rating 2029) — top 3.7% among human participants
MATH-500 Pass@1: DeepSeek-R1 97.3% vs GPT-4o 74.6% vs Claude-3.5-Sonnet 78.3%
Distilled 1.5B model (DeepSeek-R1-Distill-Qwen-1.5B) achieved AIME Pass@1 28.9%, more than 3x ahead of GPT-4o (9.3%)

How to Apply

Remove few-shot examples from reasoning API calls and switch to zero-shot + explicit output format. E.g., 'Solve the following problem and put the final answer in \\boxed{}'
For services needing small models, use DeepSeek-R1-Distill-Qwen-7B (on HuggingFace) or SFT-distill your own domain data the same way.
When deploying open-source models to production, reference the Risk Review Prompt from the paper and attach a post-processing filter to patch jailbreak vulnerabilities.

Code Example

snippet

# DeepSeek-R1 zero-shot inference call example
# pip install openai
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.deepseek.com"
)

# ✅ zero-shot: specify only the problem + output format
prompt = """
Solve the following problem step by step.
Put your final answer inside \\boxed{}.

Problem: Find all positive integers n such that n^2 + 1 is divisible by n + 1.
"""

response = client.chat.completions.create(
    model="deepseek-reasoner",  # DeepSeek-R1
    messages=[{"role": "user", "content": prompt}],
    # ❌ adding few-shot examples actually degrades performance
)

print(response.choices[0].message.content)

Terminology

GRPOGroup Relative Policy Optimization. Samples multiple answers at once and scores them by relative comparison. More memory-efficient than PPO — no separate value model needed.

SFTSupervised Fine-Tuning. Show the model gold-standard examples and have it imitate them.

CoTChain-of-Thought. Having the model write out intermediate reasoning before the final answer. Writing out steps improves accuracy.

RLHFReinforcement Learning from Human Feedback. Humans select preferred responses, creating a reward signal to train the model.

DistillationFine-tuning a small student model on data generated by a large teacher model.

Reward HackingWhen the model finds scoring loopholes instead of actually solving problems well.

MoEMixture-of-Experts. Activates only some expert modules based on input. 37B of 671B parameters used in actual computation.

Related Resources

Original Abstract (Expand)

General reasoning represents a long-standing and formidable challenge in artificial intelligence (AI). Recent breakthroughs, exemplified by large language models (LLMs)1,2 and chain-of-thought (CoT) prompting3, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent on extensive human-annotated demonstrations and the capabilities of models are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labelled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions and STEM fields, surpassing its counterparts trained through conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically used to guide and enhance the reasoning capabilities of smaller models. A new artificial intelligence model, DeepSeek-R1, is introduced, demonstrating that the reasoning abilities of large language models can be incentivized through pure reinforcement learning, removing the need for human-annotated demonstrations.