Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks
TL;DR Highlight
A methodology for improving accuracy by having another agent directly explore and synthesize the results investigated simultaneously by multiple AI agents, rather than a simple vote.
Who Should Read
ML engineers operating deep research systems or complex multi-step web search agents, and pondering how to combine parallel execution results. Backend developers designing production systems that run multiple LLM agents simultaneously.
Core Mechanics
- Simply voting on the final answers of multiple agents executed in parallel discards crucial information from the intermediate reasoning process, and putting everything into context exceeds the token limit.
- AggAgent transforms the aggregation itself into an agent task, exploring an array of completed trajectories on-demand with three lightweight tools: get_solution, search_trajectory, and get_segment.
- coarse-to-fine strategy: First, scan the final answers of all trajectories to identify agreements/disagreements, then keyword search suspicious parts and read the corresponding segments to cross-validate tool observations (ground truth) with agent reasoning.
- AggAgent can synthesize the correct answer by cross-referencing partial clues from each trajectory even when all 8 trajectories are incorrect — this is a key advantage of synthesis over simple selection.
- Aggregation cost is fixed at the level of a single agent rollout: AggAgent overhead is about 5.7% as K increases, while Summary Aggregation requires K LLM calls, resulting in a 41% overhead.
- It is effective to use a stronger model as the aggregator and assign weaker models to multiple parallel rollouts — Pass@8 is exceeded in BrowseComp-Plus when rolling out with GLM-4.7-Flash and aggregating with MiniMax-M2.5.
Evidence
- On average across 6 benchmarks, AggAgent improves by up to 5.3 points compared to the strongest existing method, Solution Aggregation, and up to 10.3 points on two deep research tasks.
- Based on GLM-4.7-Flash, AggAgent improves by an average of 13.3~17.9 points compared to Pass@1 (K=8): e.g., Healthbench-Hard 8.67 → 27.99, ResearchRubrics 37.47 → 45.31.
- The additional aggregation cost of AggAgent at K=8 is only 5.7% of the rollout cost, much cheaper than Summary Aggregation (41%) and similar to Solution Aggregation (3.7%).
- When using a strong aggregator (MiniMax-M2.5), AggAgent achieves 72.67, exceeding Pass@8 (72.00) in BrowseComp-Plus — demonstrating the possibility of synthesis surpassing individual rollout best values.
How to Apply
- Run K parallel agents, then launch a separate aggregator agent to first scan all final answers with get_solution, and use search_trajectory and get_segment to verify actual tool observations only for inconsistent trajectories. This avoids putting the entire trajectory into context, preventing linear token cost increases.
- Using an asymmetric strategy that separates the rollout model (cheap, small model) from the aggregation model (more powerful model) can improve cost-performance. Example: 8 parallel GLM-4.7-Flash + 1 MiniMax-M2.5 aggregation.
- For open-ended tasks like deep research where the answer is distributed across multiple trajectories, specify the solution field of the finish tool as a long-form report to operate in synthesis mode instead of simply choosing the best-looking trajectory. You can use the prompt from Paper Appendix B as is.
Code Example
Terminology
Related Resources
Original Abstract (Expand)
We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model's context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.