WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning
TL;DR Highlight
A framework that discovers the 'plan anchor' phenomenon—where a web search AI agent's first plan dramatically determines overall task success—and optimizes it with reinforcement learning.
Who Should Read
ML engineers building LLM-based web search agents or Deep Research pipelines. Especially developers looking to improve planning quality in multi-step agents or exploring RL fine-tuning strategies.
Core Mechanics
- Discovered the 'plan anchor' phenomenon where the first planning step has an overwhelming impact on overall task success rate — a wrong first step causes performance to plummet to 30.9% on the BrowseComp benchmark
- Conventional GRPO (a reinforcement learning algorithm) fails to properly optimize the first planning step because it distributes rewards evenly across the entire trajectory
- Anchor-GRPO separates learning into planning (Stage 1) and execution (Stage 2) — Stage 1 focuses exclusively on optimizing the first step using 'Plan Rubrics' extracted from self-play experiences
- Plan Rubrics score planning quality on a 0–5 scale across 6 dimensions including Goal Alignment, Subgoal Coverage, and Tool Appropriateness, used as a dense reward signal
- Stage 2 optimizes the execution phase with a sparse reward based solely on final answer correctness, keeping planning and execution consistently aligned
- Strong scalability confirmed: performance consistently improves as both model size (3B→30B) and context length (32k→64k) increase
Evidence
- WebAnchor-30B achieves Pass@1 76.4% on the GAIA benchmark — surpassing both OpenAI-o3 (70.5%) and WebSailor-32B (53.2%)
- WebAnchor-30B achieves Pass@1 46.0% on BrowseComp — improving over baseline GRPO (44.0%) and First-step GRPO (43.3%)
- Average Pass@8 drop when the first step is wrong: BC-ZH 28.76%, BC-EN 30.89%, GAIA 23.63%
- Planner Dense Reward (rubric-based) Pass@1 46.0% vs Naive Plan Reward 44.2% vs 0-1 Terminal Reward 43.3% — structured dense reward shows a clear advantage
How to Apply
- When training multi-step agents with RL, introduce masked credit assignment that flows gradients only through the first reasoning step — block gradients for all subsequent steps
- Collect successful and failed agent trajectories, use an LLM to auto-generate plan quality rubrics, then apply this rubric-as-reward-model pipeline to RAG agents or code agents
- The two-stage RL training structure (plan optimization → execution optimization) can be directly referenced as a stability improvement strategy when fine-tuning long-horizon research agents like OpenAI Deep Research
Code Example
# Example Plan Rubrics evaluation prompt (based on paper Appendix A.2.2)
prompt = """
You are tasked with evaluating the following plan for a web information seeking task.
Score each dimension from 0 to 5:
- Goal Alignment: Does the plan focus on what the user actually wants?
- Subgoal Coverage: How thoroughly is the problem broken down into subtasks?
- Tool Appropriateness: Are the correct sources/tools selected?
- Logical Ordering: Is the reasoning flow natural and efficient?
- Actionability: Are the instructions concrete and actually executable?
- Clarity and Conciseness: Is it easy to read and follow?
Query: {query}
Plan: {plan}
Output JSON:
{{
"Goal Alignment": {{"score": 0-5, "comment": "..."}},
"Subgoal Coverage": {{"score": 0-5, "comment": "..."}},
"Tool Appropriateness": {{"score": 0-5, "comment": "..."}},
"Logical Ordering": {{"score": 0-5, "comment": "..."}},
"Actionability": {{"score": 0-5, "comment": "..."}},
"Clarity and Conciseness": {{"score": 0-5, "comment": "..."}},
"total_score": 0-30,
"overall_comment": "..."
}}
"""
# Usage example
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": prompt.format(
query="Who was the first person to climb Everest without oxygen?",
plan="1. Search for Everest no-oxygen summit records. 2. Check Wikipedia and mountaineering databases. 3. Verify date and nationality."
)
}]
)
print(response.content[0].text)Terminology
Related Resources
Original Abstract (Expand)
Large Language Model(LLM)-based agents have shown strong capabilities in web information seeking, with reinforcement learning (RL) becoming a key optimization paradigm. However, planning remains a bottleneck, as existing methods struggle with long-horizon strategies. Our analysis reveals a critical phenomenon, plan anchor, where the first reasoning step disproportionately impacts downstream behavior in long-horizon web reasoning tasks. Current RL algorithms, fail to account for this by uniformly distributing rewards across the trajectory. To address this, we propose Anchor-GRPO, a two-stage RL framework that decouples planning and execution. In Stage 1, the agent optimizes its first-step planning using fine-grained rubrics derived from self-play experiences and human calibration. In Stage 2, execution is aligned with the initial plan through sparse rewards, ensuring stable and efficient tool usage. We evaluate Anchor-GRPO on four benchmarks: BrowseComp, BrowseComp-Zh, GAIA, and XBench-DeepSearch. Across models from 3B to 30B, Anchor-GRPO outperforms baseline GRPO and First-step GRPO, improving task success and tool efficiency. Notably, WebAnchor-30B achieves 46.0% pass@1 on BrowseComp and 76.4% on GAIA. Anchor-GRPO also demonstrates strong scalability, getting higher accuracy as model size and context length increase.