O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL

Jan 7, 2026•Yi Yao, He Zhu, Piaohong Wang +12•View PDF

TL;DR Highlight

A multi-agent system automatically generates high-quality training data and refines it with RL, building a deep research system that surpasses GPT-5 and OpenAI O3 using open-source models.

Who Should Read

ML engineers fine-tuning open-source LLMs as research agents or designing complex multi-step reasoning pipelines involving web search and crawling. Researchers interested in automated high-quality synthetic training data generation pipelines.

Core Mechanics

Decomposing queries into independent subtasks for parallel processing → overall score improved from 42.92 to 49.60 vs. sequential execution for GPT-5, Comprehensiveness from 40.59 to 49.61
Using Qwen-2.5-72B-Instruct as the base model, just two training stages (SFT + GRPO: Group Relative Policy Optimization) achieved 48.48, surpassing GPT-5 (46.77) and OpenAI O3 (43.71)
From 5,000 seed queries, candidates are generated via multi-agent workflow → filtered through rule-based hard filtering → LLM-as-a-Judge semantic filtering → human spot-checking, retaining only 3,500+ high-quality SFT data points
RL reward function designed as quality (Rbase, weight 0.6) + tool use appropriateness (Rtool, 0.2) + format (Rformat, 0.2), recovering citation accuracy that dropped during SFT (44.27% → 29.13%) back to 31.99%
Extending context length from 32k to 64k yields significant performance gains; 64k to 128k shows diminishing returns — providing practical guidelines for training data length design
10 reasoning steps is optimal: performance improves over 5 steps (48.80 → 49.61), cost is reduced vs. 20 steps, with negligible performance difference

Evidence

O-Researcher-RL achieves a RACE score of 48.48, setting SOTA among open-source deep research models — surpassing GPT-5 (46.77), OpenAI O3 (43.71), Tongyi-Deep Research (45.66), and MiroThinker (41.79)
On DeepResearchGym-Commercial-100, O-Researcher-72B scores: Clarity 100.00 (perfect), Insight 99.3, Citation Precision 51.45 — the highest citation precision across all categories
Applying the parallel execution workflow brings performance close to GPT-5's 48.88 (Gemini-2.5-Pro Deep Research); without it, the score drops to 42.92, a gap of over 6 points
Effective Citations improved ~3x from the base model (Qwen-2.5-72B-Instruct) at 8.96 to O-Researcher-RL at 26.01; overall RACE score improved from 33.38 to 48.48, a gain of +15.10 points

How to Apply

When handling complex research queries, adopt the pattern of 'planner decomposes into subtasks → independent agents execute Think-Search-Observe loops in parallel for each subtask → summarizer integrates results' to achieve significant improvements in Comprehensiveness and Insight over single LLM prompting
When creating agent training data, collect not just the final answer but serialize the entire trace — <subtask_list> → <think> → <plan> → <web_search> → <observation> → <subtask_answer> → <suggested_answer> — using XML tags as SFT training data
For RL reward design, reference the weight combination of 'quality 60% + tool use appropriateness 20% + format 20%', and apply lower bound (0 points for fewer than 2 tool calls) and upper bound (-1 point for more than 8 tool calls) penalties to suppress both excessive and insufficient searching

Code Example

snippet

# O-Researcher style deep research prompt template

SYSTEM_PROMPT = """
You are a deep research assistant. Use the following tools to answer questions.

Available Tools:
- <web_search>query1 | query2&serp_num=10</web_search>
- <crawl_page>https://example.com</crawl_page>

Workflow:
1. Start with <subtask_list> to decompose the main query into orthogonal sub-problems
2. For each subtask, follow: <think> → <plan> → tool calls → <observation> → <subtask_answer>
3. After all subtasks, synthesize into <suggested_answer>

Rules:
- <think> must appear before any plan or tool call
- Minimum 5 tool invocations, maximum 8 per subtask
- Final answer must include Introduction, Body, Conclusion, References
- Every key fact must include a citation like [1]
"""

# Example trace structure
example_trace = """
<subtask_list>
1. Analyze the historical background of [topic]
2. Examine current state-of-the-art approaches
3. Compare performance metrics across methods
</subtask_list>

<subtask>
Analyze the historical background of [topic]
</subtask>
<think>
I need to first understand the foundational work. Let me search for seminal papers.
</think>
<plan>
1. Search for early papers on [topic]
2. Crawl key reference pages
3. Synthesize timeline
</plan>
<web_search>history of [topic] seminal papers | [topic] survey 2024&serp_num=10</web_search>
<observation>
[search results here]
</observation>
<think>
Based on results, I should dig deeper into [specific aspect].
</think>
<crawl_page>https://relevant-paper-url.com</crawl_page>
<observation>
[page content]
</observation>
<subtask_answer>
[Synthesized answer for this subtask with citations [1][2]]
</subtask_answer>

<suggested_answer>
## Introduction
...
## Body
...
## Conclusion
...
## References
[1]. https://url - Paper Title
</suggested_answer>
"""

Terminology

GRPOGroup Relative Policy Optimization — an RL training approach that groups multiple responses together and evaluates them relatively by comparing good and bad outputs within the group. Rewards are computed based on relative ranking within the group rather than absolute scores, resulting in stable training.

SFTSupervised Fine-Tuning — a training method where the model learns by being shown exemplary answers to imitate. Similar to studying worked examples in school; in this paper, the full reasoning traces generated by the multi-agent system are used as training data.

RLAIFReinforcement Learning from AI Feedback — a method where AI, instead of humans, evaluates which responses are better to generate RL training signals. This significantly reduces human annotation costs but is sensitive to the quality of the evaluating AI.

Rejective SamplingA data refinement approach that generates multiple candidate outputs, filters out those that fail to meet criteria, and retains only the good ones. In this paper, it is implemented as a three-stage pipeline: rule-based filtering → LLM evaluation → human review.

LLM-as-a-JudgeAn approach where one LLM scores answers generated by another LLM. Used as a substitute for human evaluators in large-scale automated quality assessment; in this paper, a Qwen3-based model serves as the judge.

Deep Research Agent (DRA)An AI agent that autonomously generates long research reports through web search, crawling, and multi-step reasoning, rather than simple Q&A. Representative examples include OpenAI Deep Research and Perplexity Deep Research.

KL DivergenceA constraint used during RL training to prevent the policy model from drifting too far from the reference model. Acts like a seatbelt to maintain training stability.

Related Resources

Original Abstract (Expand)

The performance gap between closed-source and open-source large language models (LLMs) is largely attributed to disparities in access to high-quality training data. To bridge this gap, we introduce a novel framework for the automated synthesis of sophisticated, research-grade instructional data. Our approach centers on a multi-agent workflow where collaborative AI agents simulate complex tool-integrated reasoning to generate diverse and high-fidelity data end-to-end. Leveraging this synthesized data, we develop a two-stage training strategy that integrates supervised fine-tuning with a novel reinforcement learning method, designed to maximize model alignment and capability. Extensive experiments demonstrate that our framework empowers open-source models across multiple scales, enabling them to achieve new state-of-the-art performance on the major deep research benchmark. This work provides a scalable and effective pathway for advancing open-source LLMs without relying on proprietary data or models.