Over-Searching in Search-Augmented Large Language Models
TL;DR Highlight
A systematic study on how LLMs equipped with search tools wastefully repeat searches even for unanswerable questions, driving up costs and error rates.
Who Should Read
Backend/ML engineers operating AI agents with RAG or web search integration — especially developers who have experienced rising search costs with no improvement in quality.
Core Mechanics
- Adding search raises answer accuracy by an average of 24% on answerable questions, but drops abstention accuracy (correctly refusing to answer unanswerable questions) by an average of 12.8%
- Reasoning models like o4-mini and Deep Research systems exhibit more severe over-searching, and increasing reasoning effort leads to a monotonic rise in cost-efficiency metric TPC
- Using a noisy corpus (C5) causes TPC to spike an average of 3.6x compared to Wikipedia — the lower the retrieval quality, the more the model searches
- In multi-turn conversations, a higher proportion of answerable questions in earlier turns progressively degrades abstention ability — a 'snowball effect'
- Including negative evidence ('this question cannot be answered') in search results dramatically improves abstention accuracy, but only 13–22% of real-world corpus documents contain such signals
- Models search an average of 70.5% more than optimal (0.62 actual calls vs. 0.36 optimal calls)
Evidence
- "Answer accuracy +24.0% and abstention accuracy -12.8% when search is added (average across 10 models, Table 2); Deep Research (o4-mini-deep-research) TPC of 38,900 — 221x higher than the base model (GPT-4o-mini); C5 noisy corpus average TPC 2,607 vs. Wikipedia-Latest 733 (3.6x difference, Table 4); few-shot prompting improves abstention accuracy by an average of +13.2pp, with only a minor -1.8pp drop in answer accuracy (Table 6)"
How to Apply
- "Adding an explicit instruction such as 'respond with I don't know for unanswerable questions' along with 3–5 few-shot examples to the system prompt can raise abstention accuracy by 10pp or more (see Appendix G prompt in the paper); inserting a self-evaluation step before retrieval — where the model first judges whether the question is answerable — can reduce unnecessary search calls while minimizing answer accuracy loss; adding synthetic negative evidence documents ('this information is unknown/uncertain') to the RAG corpus improves abstention but with limited effect (+3.6%), so prioritizing negative evidence exposure during the search result reranking stage may be a more effective architectural improvement"
Code Example
# Based on paper Appendix G: Few-Shot Abstention Prompt
SYSTEM_PROMPT = """
Answer the given question. Be aware that the question may be unanswerable.
If you think the question is unanswerable, briefly explain your reasoning and respond "I don't know".
Otherwise, try your best to answer the question.
Examples:
- Question: Who will be the president of the United States in 2050?
Reasoning: It is impossible to know the president in 2050 because it is an unknown future.
Answer: I don't know
- Question: What is the capital of the moon?
Reasoning: Moon does not have a capital.
Answer: I don't know
- Question: What is the weather like?
Reasoning: The question is incomplete and ambiguous — no location or time specified.
Answer: I don't know
"""
# Example TPC (Tokens Per Correctness) calculation
def calculate_tpc(queries, results, lambda_coeff=0.25, mu_coeff=500):
"""
queries: list of {type: 'answerable'|'unanswerable'}
results: list of {generated_tokens, input_tokens, search_calls, is_correct}
"""
total_cost = 0
total_correct = 0
for result in results:
cost = (result['generated_tokens']
+ lambda_coeff * result['input_tokens']
+ mu_coeff * result['search_calls'])
total_cost += cost
total_correct += int(result['is_correct'])
if total_correct == 0:
return float('inf')
return total_cost / total_correctTerminology
Related Resources
Original Abstract (Expand)
Search-augmented large language models (LLMs) excel at knowledge-intensive tasks by integrating external retrieval. However, they often over-search -- unnecessarily invoking search tool even when it does not improve response quality, which leads to computational inefficiency and hallucinations by incorporating irrelevant context. In this work, we conduct a systematic evaluation of over-searching across multiple dimensions, including query types, model categories, retrieval conditions, and multi-turn conversations. Our finding shows: (i) search generally improves answer accuracy on answerable queries but harms abstention on unanswerable ones; (ii) over-searching is more pronounced in complex reasoning models and deep research systems, is exacerbated by noisy retrieval, and compounds across turns in multi-turn conversations; and (iii) the composition of retrieved evidence is crucial, as the presence of negative evidence improves abstention. To quantify over-searching, we introduce Tokens Per Correctness (TPC), an evaluation metric that captures the performance-cost trade-off for search-augmented LLMs. Lastly, we investigate mitigation approaches at both the query and retrieval levels and release the OverSearchQA to foster continued research into efficient search-augmented LLMs.