Over-Searching in Search-Augmented Large Language Models

Jan 9, 2026•Roy Xie, Deepak Gopinath, David Qiu +4•View PDF

TL;DR Highlight

A systematic study on how LLMs equipped with search tools wastefully repeat searches even for unanswerable questions, driving up costs and error rates.

Who Should Read

Backend/ML engineers operating AI agents with RAG or web search integration — especially developers who have experienced rising search costs with no improvement in quality.

Core Mechanics

Adding search raises answer accuracy by an average of 24% on answerable questions, but drops abstention accuracy (correctly refusing to answer unanswerable questions) by an average of 12.8%
Reasoning models like o4-mini and Deep Research systems exhibit more severe over-searching, and increasing reasoning effort leads to a monotonic rise in cost-efficiency metric TPC
Using a noisy corpus (C5) causes TPC to spike an average of 3.6x compared to Wikipedia — the lower the retrieval quality, the more the model searches
In multi-turn conversations, a higher proportion of answerable questions in earlier turns progressively degrades abstention ability — a 'snowball effect'
Including negative evidence ('this question cannot be answered') in search results dramatically improves abstention accuracy, but only 13–22% of real-world corpus documents contain such signals
Models search an average of 70.5% more than optimal (0.62 actual calls vs. 0.36 optimal calls)

Evidence

"Answer accuracy +24.0% and abstention accuracy -12.8% when search is added (average across 10 models, Table 2); Deep Research (o4-mini-deep-research) TPC of 38,900 — 221x higher than the base model (GPT-4o-mini); C5 noisy corpus average TPC 2,607 vs. Wikipedia-Latest 733 (3.6x difference, Table 4); few-shot prompting improves abstention accuracy by an average of +13.2pp, with only a minor -1.8pp drop in answer accuracy (Table 6)"

How to Apply

"Adding an explicit instruction such as 'respond with I don't know for unanswerable questions' along with 3–5 few-shot examples to the system prompt can raise abstention accuracy by 10pp or more (see Appendix G prompt in the paper); inserting a self-evaluation step before retrieval — where the model first judges whether the question is answerable — can reduce unnecessary search calls while minimizing answer accuracy loss; adding synthetic negative evidence documents ('this information is unknown/uncertain') to the RAG corpus improves abstention but with limited effect (+3.6%), so prioritizing negative evidence exposure during the search result reranking stage may be a more effective architectural improvement"

Code Example

snippet

# Based on paper Appendix G: Few-Shot Abstention Prompt
SYSTEM_PROMPT = """
Answer the given question. Be aware that the question may be unanswerable.
If you think the question is unanswerable, briefly explain your reasoning and respond "I don't know".
Otherwise, try your best to answer the question.

Examples:
- Question: Who will be the president of the United States in 2050?
  Reasoning: It is impossible to know the president in 2050 because it is an unknown future.
  Answer: I don't know

- Question: What is the capital of the moon?
  Reasoning: Moon does not have a capital.
  Answer: I don't know

- Question: What is the weather like?
  Reasoning: The question is incomplete and ambiguous — no location or time specified.
  Answer: I don't know
"""

# Example TPC (Tokens Per Correctness) calculation
def calculate_tpc(queries, results, lambda_coeff=0.25, mu_coeff=500):
    """
    queries: list of {type: 'answerable'|'unanswerable'}
    results: list of {generated_tokens, input_tokens, search_calls, is_correct}
    """
    total_cost = 0
    total_correct = 0
    
    for result in results:
        cost = (result['generated_tokens']
                + lambda_coeff * result['input_tokens']
                + mu_coeff * result['search_calls'])
        total_cost += cost
        total_correct += int(result['is_correct'])
    
    if total_correct == 0:
        return float('inf')
    return total_cost / total_correct

Terminology

abstentionThe behavior where a model refuses to answer by saying 'I don't know' or 'I cannot answer.' This is far preferable to confidently producing a wrong answer.

TPC (Tokens Per Correctness)The average number of tokens consumed to produce one correct answer. Lower is more efficient. A rising TPC as search calls increase is a signal of over-searching.

over-searchingThe phenomenon where a model continues to call search tools even for questions it already knows the answer to or that are fundamentally unanswerable. Costs go up while quality does not improve.

negative evidenceDocuments in search results that indicate 'this question cannot be answered' or 'the information is uncertain.' Their presence makes it easier for the model to decide to abstain.

reasoning modelModels such as o1, o4-mini, and DeepSeek-R1 that are trained to go through an extended reasoning process before responding. They tend to over-search more than standard models.

RAG (Retrieval-Augmented Generation)An approach where external documents are retrieved and referenced to generate an answer. Used when up-to-date information or domain-specific knowledge is needed.

snowball effectIn multi-turn conversations, the phenomenon where search patterns from earlier turns influence later turns. If many earlier questions were answerable, the model tends to force an answer even on later unanswerable questions.

Related Resources

https://github.com/apple/ml-over-searching

Original Abstract (Expand)

Search-augmented large language models (LLMs) excel at knowledge-intensive tasks by integrating external retrieval. However, they often over-search -- unnecessarily invoking search tool even when it does not improve response quality, which leads to computational inefficiency and hallucinations by incorporating irrelevant context. In this work, we conduct a systematic evaluation of over-searching across multiple dimensions, including query types, model categories, retrieval conditions, and multi-turn conversations. Our finding shows: (i) search generally improves answer accuracy on answerable queries but harms abstention on unanswerable ones; (ii) over-searching is more pronounced in complex reasoning models and deep research systems, is exacerbated by noisy retrieval, and compounds across turns in multi-turn conversations; and (iii) the composition of retrieved evidence is crucial, as the presence of negative evidence improves abstention. To quantify over-searching, we introduce Tokens Per Correctness (TPC), an evaluation metric that captures the performance-cost trade-off for search-augmented LLMs. Lastly, we investigate mitigation approaches at both the query and retrieval levels and release the OverSearchQA to foster continued research into efficient search-augmented LLMs.