Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

Apr 13, 2026•Yoonsang Lee, Howard Yen, Xi Ye +1•View PDF

TL;DR Highlight

A methodology for improving accuracy by having another agent directly explore and synthesize the results investigated simultaneously by multiple AI agents, rather than a simple vote.

Who Should Read

ML engineers operating deep research systems or complex multi-step web search agents, and pondering how to combine parallel execution results. Backend developers designing production systems that run multiple LLM agents simultaneously.

Core Mechanics

Simply voting on the final answers of multiple agents executed in parallel discards crucial information from the intermediate reasoning process, and putting everything into context exceeds the token limit.
AggAgent transforms the aggregation itself into an agent task, exploring an array of completed trajectories on-demand with three lightweight tools: get_solution, search_trajectory, and get_segment.
coarse-to-fine strategy: First, scan the final answers of all trajectories to identify agreements/disagreements, then keyword search suspicious parts and read the corresponding segments to cross-validate tool observations (ground truth) with agent reasoning.
AggAgent can synthesize the correct answer by cross-referencing partial clues from each trajectory even when all 8 trajectories are incorrect — this is a key advantage of synthesis over simple selection.
Aggregation cost is fixed at the level of a single agent rollout: AggAgent overhead is about 5.7% as K increases, while Summary Aggregation requires K LLM calls, resulting in a 41% overhead.
It is effective to use a stronger model as the aggregator and assign weaker models to multiple parallel rollouts — Pass@8 is exceeded in BrowseComp-Plus when rolling out with GLM-4.7-Flash and aggregating with MiniMax-M2.5.

Evidence

On average across 6 benchmarks, AggAgent improves by up to 5.3 points compared to the strongest existing method, Solution Aggregation, and up to 10.3 points on two deep research tasks.
Based on GLM-4.7-Flash, AggAgent improves by an average of 13.3~17.9 points compared to Pass@1 (K=8): e.g., Healthbench-Hard 8.67 → 27.99, ResearchRubrics 37.47 → 45.31.
The additional aggregation cost of AggAgent at K=8 is only 5.7% of the rollout cost, much cheaper than Summary Aggregation (41%) and similar to Solution Aggregation (3.7%).
When using a strong aggregator (MiniMax-M2.5), AggAgent achieves 72.67, exceeding Pass@8 (72.00) in BrowseComp-Plus — demonstrating the possibility of synthesis surpassing individual rollout best values.

How to Apply

Run K parallel agents, then launch a separate aggregator agent to first scan all final answers with get_solution, and use search_trajectory and get_segment to verify actual tool observations only for inconsistent trajectories. This avoids putting the entire trajectory into context, preventing linear token cost increases.
Using an asymmetric strategy that separates the rollout model (cheap, small model) from the aggregation model (more powerful model) can improve cost-performance. Example: 8 parallel GLM-4.7-Flash + 1 MiniMax-M2.5 aggregation.
For open-ended tasks like deep research where the answer is distributed across multiple trajectories, specify the solution field of the finish tool as a long-form report to operate in synthesis mode instead of simply choosing the best-looking trajectory. You can use the prompt from Paper Appendix B as is.

Code Example

snippet

Terminology

trajectoryA continuous record of an agent's thoughts, tool calls, and observations during problem-solving. Similar to a complete problem-solving notebook for a person.

parallel scalingA method to increase the probability of finding the correct answer by having multiple agents solve the same problem simultaneously and independently. Similar to having multiple people each solve a test and then pooling the answers.

test-time scalingA strategy to improve performance at inference time by using more computation instead of further training the model. Similar to allowing more time to think during a test.

long-horizon agentic taskA complex task that requires dozens to hundreds of repetitions of web search, document reading, code execution, etc. Unlike simple Q&A, it requires planning and execution over multiple steps.

majority votingA method of selecting the final answer from multiple model outputs by choosing the most frequent answer. The same principle as a simple majority vote.

Best-of-N (BoN)A method of selecting the answer that the model itself evaluates as the most confident among N outputs. A self-scoring method.

ROUGE-LA metric for measuring the similarity between two texts. It calculates similarity based on the sequence of common words.

LLM-as-a-judgeA method of having another LLM evaluate the quality of model outputs instead of a person. Utilizing a powerful model like GPT-4 as a grader.

Related Resources

AggAgent GitHub Repository

Original Abstract (Expand)

We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model's context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.