WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning

Jan 6, 2026•Xinmiao Yu, Liwen Zhang, Xiaocheng Feng +4•View PDF

TL;DR Highlight

A framework that discovers the 'plan anchor' phenomenon—where a web search AI agent's first plan dramatically determines overall task success—and optimizes it with reinforcement learning.

Who Should Read

ML engineers building LLM-based web search agents or Deep Research pipelines. Especially developers looking to improve planning quality in multi-step agents or exploring RL fine-tuning strategies.

Core Mechanics

Discovered the 'plan anchor' phenomenon where the first planning step has an overwhelming impact on overall task success rate — a wrong first step causes performance to plummet to 30.9% on the BrowseComp benchmark
Conventional GRPO (a reinforcement learning algorithm) fails to properly optimize the first planning step because it distributes rewards evenly across the entire trajectory
Anchor-GRPO separates learning into planning (Stage 1) and execution (Stage 2) — Stage 1 focuses exclusively on optimizing the first step using 'Plan Rubrics' extracted from self-play experiences
Plan Rubrics score planning quality on a 0–5 scale across 6 dimensions including Goal Alignment, Subgoal Coverage, and Tool Appropriateness, used as a dense reward signal
Stage 2 optimizes the execution phase with a sparse reward based solely on final answer correctness, keeping planning and execution consistently aligned
Strong scalability confirmed: performance consistently improves as both model size (3B→30B) and context length (32k→64k) increase

Evidence

WebAnchor-30B achieves Pass@1 76.4% on the GAIA benchmark — surpassing both OpenAI-o3 (70.5%) and WebSailor-32B (53.2%)
WebAnchor-30B achieves Pass@1 46.0% on BrowseComp — improving over baseline GRPO (44.0%) and First-step GRPO (43.3%)
Average Pass@8 drop when the first step is wrong: BC-ZH 28.76%, BC-EN 30.89%, GAIA 23.63%
Planner Dense Reward (rubric-based) Pass@1 46.0% vs Naive Plan Reward 44.2% vs 0-1 Terminal Reward 43.3% — structured dense reward shows a clear advantage

How to Apply

When training multi-step agents with RL, introduce masked credit assignment that flows gradients only through the first reasoning step — block gradients for all subsequent steps
Collect successful and failed agent trajectories, use an LLM to auto-generate plan quality rubrics, then apply this rubric-as-reward-model pipeline to RAG agents or code agents
The two-stage RL training structure (plan optimization → execution optimization) can be directly referenced as a stability improvement strategy when fine-tuning long-horizon research agents like OpenAI Deep Research

Code Example

snippet

# Example Plan Rubrics evaluation prompt (based on paper Appendix A.2.2)
prompt = """
You are tasked with evaluating the following plan for a web information seeking task.
Score each dimension from 0 to 5:

- Goal Alignment: Does the plan focus on what the user actually wants?
- Subgoal Coverage: How thoroughly is the problem broken down into subtasks?
- Tool Appropriateness: Are the correct sources/tools selected?
- Logical Ordering: Is the reasoning flow natural and efficient?
- Actionability: Are the instructions concrete and actually executable?
- Clarity and Conciseness: Is it easy to read and follow?

Query: {query}
Plan: {plan}

Output JSON:
{{
  "Goal Alignment": {{"score": 0-5, "comment": "..."}},
  "Subgoal Coverage": {{"score": 0-5, "comment": "..."}},
  "Tool Appropriateness": {{"score": 0-5, "comment": "..."}},
  "Logical Ordering": {{"score": 0-5, "comment": "..."}},
  "Actionability": {{"score": 0-5, "comment": "..."}},
  "Clarity and Conciseness": {{"score": 0-5, "comment": "..."}},
  "total_score": 0-30,
  "overall_comment": "..."
}}
"""

# Usage example
from anthropic import Anthropic
client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": prompt.format(
            query="Who was the first person to climb Everest without oxygen?",
            plan="1. Search for Everest no-oxygen summit records. 2. Check Wikipedia and mountaineering databases. 3. Verify date and nationality."
        )
    }]
)
print(response.content[0].text)

Terminology

GRPOGroup Relative Policy Optimization. A reinforcement learning technique that generates multiple responses simultaneously and trains the model by comparing better and worse outputs against each other.

Long-HorizonA lengthy task that requires many steps to complete. 'What should I have for lunch today' is short-horizon; 'start a company and take it public' is long-horizon.

Plan AnchorA phenomenon discovered in this paper. The first planning step fixes the direction of the entire task like an anchor — just as buttoning the first button wrong means everything that follows will be misaligned no matter how well you do it.

Sparse RewardA reward scheme that ignores intermediate steps and only rewards the final outcome. Like in Go, giving no score for individual moves and only awarding +1 point for winning the game.

Dense RewardA reward scheme that provides fine-grained feedback at every step. Intermediate plan quality is also converted into a score and used as a training signal.

ReActA portmanteau of Reasoning + Acting. An agent pattern where an LLM iteratively cycles through Thought → Action → Observation to solve problems.

Pass@1The probability of getting the correct answer on a single attempt. Pass@3 is the probability of getting it right at least once in three attempts.

Related Resources

Original Abstract (Expand)

Large Language Model(LLM)-based agents have shown strong capabilities in web information seeking, with reinforcement learning (RL) becoming a key optimization paradigm. However, planning remains a bottleneck, as existing methods struggle with long-horizon strategies. Our analysis reveals a critical phenomenon, plan anchor, where the first reasoning step disproportionately impacts downstream behavior in long-horizon web reasoning tasks. Current RL algorithms, fail to account for this by uniformly distributing rewards across the trajectory. To address this, we propose Anchor-GRPO, a two-stage RL framework that decouples planning and execution. In Stage 1, the agent optimizes its first-step planning using fine-grained rubrics derived from self-play experiences and human calibration. In Stage 2, execution is aligned with the initial plan through sparse rewards, ensuring stable and efficient tool usage. We evaluate Anchor-GRPO on four benchmarks: BrowseComp, BrowseComp-Zh, GAIA, and XBench-DeepSearch. Across models from 3B to 30B, Anchor-GRPO outperforms baseline GRPO and First-step GRPO, improving task success and tool efficiency. Notably, WebAnchor-30B achieves 46.0% pass@1 on BrowseComp and 76.4% on GAIA. Anchor-GRPO also demonstrates strong scalability, getting higher accuracy as model size and context length increase.