Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
TL;DR Highlight
The MADQA benchmark (800 PDFs, 2,250 questions) shows that even top AI agents can't navigate documents 'strategically' the way humans do.
Who Should Read
Researchers building document understanding and multi-document QA systems, and teams evaluating AI agents on realistic long-document tasks.
Core Mechanics
- Introduced MADQA: a benchmark of 800 PDFs with 2,250 questions requiring strategic multi-document navigation
- Questions require understanding document structure, cross-referencing sections, and strategic navigation decisions
- Top AI agents fail to replicate human-like strategic document traversal patterns
- Humans navigate documents by forming and testing hypotheses; agents tend toward naive sequential scanning
- The benchmark reveals a gap in agents' ability to build and use document structure models
- Existing RAG approaches fall short because they don't model navigational intent
Evidence
- Top AI agents score significantly below human performance on MADQA
- Agent navigation patterns analyzed via retrieval traces show lack of strategic targeting
- Humans achieve higher accuracy with fewer document accesses compared to agents
- Performance gap widens on questions requiring multi-hop reasoning across document sections
How to Apply
- Use MADQA to benchmark your document agent before deploying in production PDF-heavy workflows
- If scores are low, investigate navigation strategy — agents need explicit planning steps before diving into documents
- Consider adding document structure understanding (table of contents, section headers) as explicit context for the agent
Code Example
# BM25 MLLM Agent core loop (Python pseudocode based on paper Algorithm 1)
from whoosh import index, qparser
from PIL import Image
import base64
SYSTEM_PROMPT = """
You are a document QA assistant with access to a search tool.
The answer is definitely in the documents.
If search returns no results, try different terms (synonyms, abbreviations, rephrasing).
Once relevant pages are found, provide:
1. answer: List of short answer values (exact document words preferred)
2. citations: List of {file, page} dicts
"""
def bm25_agent(question: str, search_index, vlm_client, max_steps=10, top_k=5):
messages = [{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": question}]
tools = [{
"name": "search_documents",
"description": "Search document collection. Supports boolean ops (AND/OR/NOT), phrases in quotes, wildcards (*). Example: '\"annual report\" AND revenue'",
"parameters": {"query": {"type": "string"}}
}]
for step in range(max_steps):
force_answer = (step == max_steps - 1)
response = vlm_client.chat(
messages=messages,
tools=None if force_answer else tools,
# force structured answer output on last step
)
if response.type == "answer":
return response.answer, response.citations
elif response.type == "tool_call" and response.tool == "search_documents":
query = response.args["query"]
# BM25 search then convert to page images
results = search_index.search(query, limit=top_k) # returns (file, page) tuples
page_images = [render_page_as_image(file, page) for file, page in results]
# Add images to messages (VLM analyzes visually)
tool_result = {"role": "tool", "content": [
{"type": "image", "data": img_to_base64(img)} for img in page_images
]}
messages.append(tool_result)
return fallback_answer(messages)
# Key point: boldly use different queries when search fails
# Claude Sonnet 4.5's average query drift of 0.38 is the key to performance
# Repeating similar queries like GPT-4.1 Nano (drift 0.10) leads to performance degradationTerminology
Related Resources
Original Abstract (Expand)
Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.