LIMO: Less is More for Reasoning
TL;DR Highlight
SFT with just 800 examples achieved 63.3% on AIME24 — 1% of the data from existing methods, beating o1-preview.
Who Should Read
ML engineers running LLM fine-tuning pipelines or building math/reasoning specialized models. Teams wanting to maintain performance while reducing training data costs.
Core Mechanics
- SFT on Qwen2.5-32B-Instruct with 800 samples achieves AIME24 63.3% — 10x higher than same base model with NuminaMath 100k samples (6.5%)
- Quality over quantity: 800 LIMO samples (78.1% average) outperform 114k OpenThoughts (58.3%) across all benchmarks including OOD
- 4 qualities of high-quality reasoning chains: detailed step-by-step explanations, self-verification ('check', 'verify'), exploratory expressions ('perhaps', 'might'), logical connectors ('therefore', 'since')
- Even 400 samples caused AIME24 to jump from 16.5 to 57.5 — problem selection by difficulty/quality is the key, not data volume
- Pretrained quality is a prerequisite: same 800 samples give Qwen1.5-32B 9.2% vs Qwen2.5-32B 63.3% — requires sufficient math knowledge in the base model
- Strong OOD generalization: outperforms models trained on 100x more data on Chinese college exams, graduate exams, GPQA, etc.
Evidence
- AIME24: LIMO 63.3% vs NuminaMath-100k 6.5% vs OpenAI-o1-preview 44.6% vs QwQ-32B-Preview 50.0%
- MATH500: LIMO 95.6% vs NuminaMath-100k 59.2% vs QwQ-32B-Preview 89.8%
- Overall benchmark average (including OOD): LIMO 78.1% vs OpenThoughts-114k 58.3% vs NuminaMath-100k 32.3%
- 400 samples alone achieved AIME24 57.5% — +41pp vs base model (16.5%); diminishing returns after 800 samples
How to Apply
- 2-stage difficulty filtering: first remove easy problems with Qwen2.5-Math-7B-Instruct, then sample 32 times with DeepSeek-R1-Distill-Qwen-32B keeping only problems that succeed 1-3 times. Compresses hundreds of thousands down to ~2,000.
- Automatic reasoning chain quality scoring: normalize by text length — step detail (30%) + verification expression frequency (20%) + exploratory expression frequency (25%) + logical connectors (25%) — use top 800 for SFT.
- Base model selection matters more than data: models with sufficient math pretraining (Qwen2.5, DeepSeek series) are needed for the small-data SFT strategy to work.
Code Example
# LIMO-style reasoning chain quality scoring function
import re
def score_reasoning_chain(text: str) -> float:
length = len(text.split())
if length == 0:
return 0.0
# Self-verification expressions (20%)
verify_words = ['check', 'verify', 'confirm', 'validate', 'let me verify']
verify_score = sum(text.lower().count(w) for w in verify_words) / length
# Exploratory expressions (25%)
explore_words = ['perhaps', 'might', 'maybe', 'alternatively', 'let me try']
explore_score = sum(text.lower().count(w) for w in explore_words) / length
# Logical connectives (25%)
logic_words = ['therefore', 'since', 'because', 'thus', 'hence', 'so']
logic_score = sum(text.lower().count(w) for w in logic_words) / length
# Length score (30%) — longer reasoning is more elaborated
length_score = min(length / 2000, 1.0) # Normalized to 2000 tokens
total = (
length_score * 0.30 +
verify_score * 1000 * 0.20 + # Frequency scale adjustment
explore_score * 1000 * 0.25 +
logic_score * 1000 * 0.25
)
return total
# Usage example
samples = [(q, r, a) for q, r, a in dataset]
scored = [(score_reasoning_chain(r), q, r, a) for q, r, a in samples]
scored.sort(reverse=True)
top_800 = scored[:800] # Use only top 800 as SFT dataTerminology
Related Resources
Original Abstract (Expand)
We challenge the prevailing assumption that complex reasoning in large language models (LLMs) necessitates massive training data. We demonstrate that sophisticated mathematical reasoning can emerge with only a few examples. Specifically, through simple supervised fine-tuning, our model, LIMO, achieves 63.3\% accuracy on AIME24 and 95.6\% on MATH500, surpassing previous fine-tuned models (6.5\% on AIME24, 59.2\% on MATH500) while using only 1\% of the training data required by prior approaches. Furthermore, LIMO exhibits strong out-of-distribution generalization, achieving a 45.8\% absolute improvement across diverse benchmarks, outperforming models trained on 100x more data. Synthesizing these findings, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning can emerge through minimal but strategically designed demonstrations of cognitive processes. This hypothesis suggests that the threshold for eliciting complex reasoning is not dictated by task complexity but rather by two key factors: (1) the completeness of the model's pre-trained knowledge base and (2) the effectiveness of post-training examples in serving as"cognitive templates"that guide reasoning.