O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL
TL;DR Highlight
A multi-agent system automatically generates high-quality training data and refines it with RL, building a deep research system that surpasses GPT-5 and OpenAI O3 using open-source models.
Who Should Read
ML engineers fine-tuning open-source LLMs as research agents or designing complex multi-step reasoning pipelines involving web search and crawling. Researchers interested in automated high-quality synthetic training data generation pipelines.
Core Mechanics
- Decomposing queries into independent subtasks for parallel processing → overall score improved from 42.92 to 49.60 vs. sequential execution for GPT-5, Comprehensiveness from 40.59 to 49.61
- Using Qwen-2.5-72B-Instruct as the base model, just two training stages (SFT + GRPO: Group Relative Policy Optimization) achieved 48.48, surpassing GPT-5 (46.77) and OpenAI O3 (43.71)
- From 5,000 seed queries, candidates are generated via multi-agent workflow → filtered through rule-based hard filtering → LLM-as-a-Judge semantic filtering → human spot-checking, retaining only 3,500+ high-quality SFT data points
- RL reward function designed as quality (Rbase, weight 0.6) + tool use appropriateness (Rtool, 0.2) + format (Rformat, 0.2), recovering citation accuracy that dropped during SFT (44.27% → 29.13%) back to 31.99%
- Extending context length from 32k to 64k yields significant performance gains; 64k to 128k shows diminishing returns — providing practical guidelines for training data length design
- 10 reasoning steps is optimal: performance improves over 5 steps (48.80 → 49.61), cost is reduced vs. 20 steps, with negligible performance difference
Evidence
- O-Researcher-RL achieves a RACE score of 48.48, setting SOTA among open-source deep research models — surpassing GPT-5 (46.77), OpenAI O3 (43.71), Tongyi-Deep Research (45.66), and MiroThinker (41.79)
- On DeepResearchGym-Commercial-100, O-Researcher-72B scores: Clarity 100.00 (perfect), Insight 99.3, Citation Precision 51.45 — the highest citation precision across all categories
- Applying the parallel execution workflow brings performance close to GPT-5's 48.88 (Gemini-2.5-Pro Deep Research); without it, the score drops to 42.92, a gap of over 6 points
- Effective Citations improved ~3x from the base model (Qwen-2.5-72B-Instruct) at 8.96 to O-Researcher-RL at 26.01; overall RACE score improved from 33.38 to 48.48, a gain of +15.10 points
How to Apply
- When handling complex research queries, adopt the pattern of 'planner decomposes into subtasks → independent agents execute Think-Search-Observe loops in parallel for each subtask → summarizer integrates results' to achieve significant improvements in Comprehensiveness and Insight over single LLM prompting
- When creating agent training data, collect not just the final answer but serialize the entire trace — <subtask_list> → <think> → <plan> → <web_search> → <observation> → <subtask_answer> → <suggested_answer> — using XML tags as SFT training data
- For RL reward design, reference the weight combination of 'quality 60% + tool use appropriateness 20% + format 20%', and apply lower bound (0 points for fewer than 2 tool calls) and upper bound (-1 point for more than 8 tool calls) penalties to suppress both excessive and insufficient searching
Code Example
# O-Researcher style deep research prompt template
SYSTEM_PROMPT = """
You are a deep research assistant. Use the following tools to answer questions.
Available Tools:
- <web_search>query1 | query2&serp_num=10</web_search>
- <crawl_page>https://example.com</crawl_page>
Workflow:
1. Start with <subtask_list> to decompose the main query into orthogonal sub-problems
2. For each subtask, follow: <think> → <plan> → tool calls → <observation> → <subtask_answer>
3. After all subtasks, synthesize into <suggested_answer>
Rules:
- <think> must appear before any plan or tool call
- Minimum 5 tool invocations, maximum 8 per subtask
- Final answer must include Introduction, Body, Conclusion, References
- Every key fact must include a citation like [1]
"""
# Example trace structure
example_trace = """
<subtask_list>
1. Analyze the historical background of [topic]
2. Examine current state-of-the-art approaches
3. Compare performance metrics across methods
</subtask_list>
<subtask>
Analyze the historical background of [topic]
</subtask>
<think>
I need to first understand the foundational work. Let me search for seminal papers.
</think>
<plan>
1. Search for early papers on [topic]
2. Crawl key reference pages
3. Synthesize timeline
</plan>
<web_search>history of [topic] seminal papers | [topic] survey 2024&serp_num=10</web_search>
<observation>
[search results here]
</observation>
<think>
Based on results, I should dig deeper into [specific aspect].
</think>
<crawl_page>https://relevant-paper-url.com</crawl_page>
<observation>
[page content]
</observation>
<subtask_answer>
[Synthesized answer for this subtask with citations [1][2]]
</subtask_answer>
<suggested_answer>
## Introduction
...
## Body
...
## Conclusion
...
## References
[1]. https://url - Paper Title
</suggested_answer>
"""Terminology
Related Resources
Original Abstract (Expand)
The performance gap between closed-source and open-source large language models (LLMs) is largely attributed to disparities in access to high-quality training data. To bridge this gap, we introduce a novel framework for the automated synthesis of sophisticated, research-grade instructional data. Our approach centers on a multi-agent workflow where collaborative AI agents simulate complex tool-integrated reasoning to generate diverse and high-fidelity data end-to-end. Leveraging this synthesized data, we develop a two-stage training strategy that integrates supervised fine-tuning with a novel reinforcement learning method, designed to maximize model alignment and capability. Extensive experiments demonstrate that our framework empowers open-source models across multiple scales, enabling them to achieve new state-of-the-art performance on the major deep research benchmark. This work provides a scalable and effective pathway for advancing open-source LLMs without relying on proprietary data or models.