Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Mar 12, 2026•Łukasz Borchmann, Jordy Van Landeghem, Michał Turski +12•View PDF

TL;DR Highlight

The MADQA benchmark (800 PDFs, 2,250 questions) shows that even top AI agents can't navigate documents 'strategically' the way humans do.

Who Should Read

Researchers building document understanding and multi-document QA systems, and teams evaluating AI agents on realistic long-document tasks.

Core Mechanics

Introduced MADQA: a benchmark of 800 PDFs with 2,250 questions requiring strategic multi-document navigation
Questions require understanding document structure, cross-referencing sections, and strategic navigation decisions
Top AI agents fail to replicate human-like strategic document traversal patterns
Humans navigate documents by forming and testing hypotheses; agents tend toward naive sequential scanning
The benchmark reveals a gap in agents' ability to build and use document structure models
Existing RAG approaches fall short because they don't model navigational intent

Evidence

Top AI agents score significantly below human performance on MADQA
Agent navigation patterns analyzed via retrieval traces show lack of strategic targeting
Humans achieve higher accuracy with fewer document accesses compared to agents
Performance gap widens on questions requiring multi-hop reasoning across document sections

How to Apply

Use MADQA to benchmark your document agent before deploying in production PDF-heavy workflows
If scores are low, investigate navigation strategy — agents need explicit planning steps before diving into documents
Consider adding document structure understanding (table of contents, section headers) as explicit context for the agent

Code Example

snippet

# BM25 MLLM Agent core loop (Python pseudocode based on paper Algorithm 1)

from whoosh import index, qparser
from PIL import Image
import base64

SYSTEM_PROMPT = """
You are a document QA assistant with access to a search tool.
The answer is definitely in the documents.
If search returns no results, try different terms (synonyms, abbreviations, rephrasing).

Once relevant pages are found, provide:
1. answer: List of short answer values (exact document words preferred)
2. citations: List of {file, page} dicts
"""

def bm25_agent(question: str, search_index, vlm_client, max_steps=10, top_k=5):
    messages = [{"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": question}]
    
    tools = [{
        "name": "search_documents",
        "description": "Search document collection. Supports boolean ops (AND/OR/NOT), phrases in quotes, wildcards (*). Example: '\"annual report\" AND revenue'",
        "parameters": {"query": {"type": "string"}}
    }]
    
    for step in range(max_steps):
        force_answer = (step == max_steps - 1)
        
        response = vlm_client.chat(
            messages=messages,
            tools=None if force_answer else tools,
            # force structured answer output on last step
        )
        
        if response.type == "answer":
            return response.answer, response.citations
        
        elif response.type == "tool_call" and response.tool == "search_documents":
            query = response.args["query"]
            
            # BM25 search then convert to page images
            results = search_index.search(query, limit=top_k)  # returns (file, page) tuples
            page_images = [render_page_as_image(file, page) for file, page in results]
            
            # Add images to messages (VLM analyzes visually)
            tool_result = {"role": "tool", "content": [
                {"type": "image", "data": img_to_base64(img)} for img in page_images
            ]}
            messages.append(tool_result)
    
    return fallback_answer(messages)

# Key point: boldly use different queries when search fails
# Claude Sonnet 4.5's average query drift of 0.38 is the key to performance
# Repeating similar queries like GPT-4.1 Nano (drift 0.10) leads to performance degradation

Terminology

Strategic NavigationThe ability to form a plan for which parts of a document to read and in what order, based on the question being answered.

Multi-Document QAQuestion answering that requires synthesizing information from multiple documents or sections.

RAGRetrieval-Augmented Generation — retrieving relevant passages and using them as context for LLM generation.

Document TraversalThe sequence of sections or pages an agent reads when answering a question about a document.

Related Resources

Original Abstract (Expand)

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.