LIMO: Less is More for Reasoning

Feb 5, 2025•Yixin Ye, Zhen Huang, Yang Xiao +3•View PDF

TL;DR Highlight

SFT with just 800 examples achieved 63.3% on AIME24 — 1% of the data from existing methods, beating o1-preview.

Who Should Read

ML engineers running LLM fine-tuning pipelines or building math/reasoning specialized models. Teams wanting to maintain performance while reducing training data costs.

Core Mechanics

SFT on Qwen2.5-32B-Instruct with 800 samples achieves AIME24 63.3% — 10x higher than same base model with NuminaMath 100k samples (6.5%)
Quality over quantity: 800 LIMO samples (78.1% average) outperform 114k OpenThoughts (58.3%) across all benchmarks including OOD
4 qualities of high-quality reasoning chains: detailed step-by-step explanations, self-verification ('check', 'verify'), exploratory expressions ('perhaps', 'might'), logical connectors ('therefore', 'since')
Even 400 samples caused AIME24 to jump from 16.5 to 57.5 — problem selection by difficulty/quality is the key, not data volume
Pretrained quality is a prerequisite: same 800 samples give Qwen1.5-32B 9.2% vs Qwen2.5-32B 63.3% — requires sufficient math knowledge in the base model
Strong OOD generalization: outperforms models trained on 100x more data on Chinese college exams, graduate exams, GPQA, etc.

Evidence

AIME24: LIMO 63.3% vs NuminaMath-100k 6.5% vs OpenAI-o1-preview 44.6% vs QwQ-32B-Preview 50.0%
MATH500: LIMO 95.6% vs NuminaMath-100k 59.2% vs QwQ-32B-Preview 89.8%
Overall benchmark average (including OOD): LIMO 78.1% vs OpenThoughts-114k 58.3% vs NuminaMath-100k 32.3%
400 samples alone achieved AIME24 57.5% — +41pp vs base model (16.5%); diminishing returns after 800 samples

How to Apply

2-stage difficulty filtering: first remove easy problems with Qwen2.5-Math-7B-Instruct, then sample 32 times with DeepSeek-R1-Distill-Qwen-32B keeping only problems that succeed 1-3 times. Compresses hundreds of thousands down to ~2,000.
Automatic reasoning chain quality scoring: normalize by text length — step detail (30%) + verification expression frequency (20%) + exploratory expression frequency (25%) + logical connectors (25%) — use top 800 for SFT.
Base model selection matters more than data: models with sufficient math pretraining (Qwen2.5, DeepSeek series) are needed for the small-data SFT strategy to work.

Code Example

snippet

# LIMO-style reasoning chain quality scoring function
import re

def score_reasoning_chain(text: str) -> float:
    length = len(text.split())
    if length == 0:
        return 0.0

    # Self-verification expressions (20%)
    verify_words = ['check', 'verify', 'confirm', 'validate', 'let me verify']
    verify_score = sum(text.lower().count(w) for w in verify_words) / length

    # Exploratory expressions (25%)
    explore_words = ['perhaps', 'might', 'maybe', 'alternatively', 'let me try']
    explore_score = sum(text.lower().count(w) for w in explore_words) / length

    # Logical connectives (25%)
    logic_words = ['therefore', 'since', 'because', 'thus', 'hence', 'so']
    logic_score = sum(text.lower().count(w) for w in logic_words) / length

    # Length score (30%) — longer reasoning is more elaborated
    length_score = min(length / 2000, 1.0)  # Normalized to 2000 tokens

    total = (
        length_score * 0.30 +
        verify_score * 1000 * 0.20 +  # Frequency scale adjustment
        explore_score * 1000 * 0.25 +
        logic_score * 1000 * 0.25
    )
    return total

# Usage example
samples = [(q, r, a) for q, r, a in dataset]
scored = [(score_reasoning_chain(r), q, r, a) for q, r, a in samples]
scored.sort(reverse=True)
top_800 = scored[:800]  # Use only top 800 as SFT data

Terminology

SFTSupervised Fine-Tuning. Show the model gold-standard examples and have it imitate them.

CoTChain of Thought. Having the model explicitly write out intermediate reasoning steps.

OODOut-of-Distribution. Problem types with different distribution from training data. High OOD performance means truly understood, not memorized.

pass@1The probability of getting the correct answer on the first try.

AIMEAmerican Invitational Mathematics Examination. A math olympiad qualifier used as a ceiling benchmark for LLM reasoning.

inference-time scalingImproving performance by using more tokens (thinking process) during inference, rather than scaling model size. How o1 and DeepSeek-R1 work.

DeepSpeed ZeRO-3A technique for distributing large model training across multiple GPUs. Infrastructure tool enabling 32B full fine-tuning.

Related Resources

https://github.com/GAIR-NLP/LIMO

Original Abstract (Expand)

We challenge the prevailing assumption that complex reasoning in large language models (LLMs) necessitates massive training data. We demonstrate that sophisticated mathematical reasoning can emerge with only a few examples. Specifically, through simple supervised fine-tuning, our model, LIMO, achieves 63.3\% accuracy on AIME24 and 95.6\% on MATH500, surpassing previous fine-tuned models (6.5\% on AIME24, 59.2\% on MATH500) while using only 1\% of the training data required by prior approaches. Furthermore, LIMO exhibits strong out-of-distribution generalization, achieving a 45.8\% absolute improvement across diverse benchmarks, outperforming models trained on 100x more data. Synthesizing these findings, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning can emerge through minimal but strategically designed demonstrations of cognitive processes. This hypothesis suggests that the threshold for eliciting complex reasoning is not dictated by task complexity but rather by two key factors: (1) the completeness of the model's pre-trained knowledge base and (2) the effectiveness of post-training examples in serving as"cognitive templates"that guide reasoning.