Search-Augmented LLM의 Over-Searching 문제

Over-Searching in Search-Augmented Large Language Models

Jan 9, 2026•Roy Xie, Deepak Gopinath, David Qiu +4•View PDF

TL;DR Highlight

검색 도구를 붙인 LLM이 답할 수 없는 질문에도 쓸데없이 검색을 반복해 비용과 오답률을 높이는 현상을 체계적으로 분석한 논문.

Who Should Read

RAG나 웹 검색을 붙인 AI 에이전트를 운영 중인 백엔드/ML 엔지니어. 특히 검색 비용이 늘어나는데 품질은 그대로인 상황을 경험한 개발자.

Core Mechanics

검색을 붙이면 답할 수 있는 질문의 정확도는 평균 24% 오르지만, 답할 수 없는 질문에서 '모르겠다'고 제대로 거부하는 비율(abstention accuracy)은 평균 12.8% 떨어짐
o4-mini 같은 reasoning 모델이나 Deep Research 시스템일수록 over-searching이 더 심각하고, reasoning effort를 높일수록 비용 대비 효율(TPC)이 단조 증가함
노이즈 많은 코퍼스(C5)를 쓰면 TPC가 Wikipedia 대비 평균 3.6배 급증 — 검색 품질이 나쁠수록 모델이 더 많이 검색함
멀티턴 대화에서 이전 턴에 답할 수 있는 질문이 많았을수록 abstention 능력이 점점 떨어지는 '눈덩이(snowball) 효과' 발생
검색 결과에 '이 질문은 답할 수 없다'는 negative evidence가 포함되면 abstention 정확도가 극적으로 올라가지만, 실제 코퍼스에서 그런 문서는 13~22%에 불과
모델들이 최적 대비 평균 70.5% 더 많이 검색함 (실제 0.62회 vs 최적 0.36회)

Evidence

검색 추가 시 answer accuracy +24.0%, abstention accuracy -12.8% (10개 모델 평균, Table 2)
Deep Research(o4-mini-deep-research)의 TPC는 38,900 — base 모델(GPT-4o-mini) 대비 221배
C5 노이즈 코퍼스 TPC 평균 2,607 vs Wikipedia-Latest 733 (3.6배 차이, Table 4)
few-shot 프롬프트로 abstention accuracy 평균 +13.2%p 개선, 단 answer accuracy는 -1.8%p 소폭 감소 (Table 6)

How to Apply

시스템 프롬프트에 'answerable하지 않은 질문은 모르겠다고 답하라'는 명시적 지시와 few-shot 예시 3~5개를 추가하면 abstention accuracy를 10%p 이상 높일 수 있음 (논문 Appendix G 프롬프트 참고)
검색 전에 self-evaluation 단계를 추가해 모델이 질문의 answerable 여부를 먼저 판단하게 하면 불필요한 검색 호출을 줄이면서 answer accuracy 손실을 최소화할 수 있음
RAG 코퍼스에 '이 정보는 알 수 없다/불확실하다'는 synthetic negative evidence 문서를 추가하면 abstention이 개선되지만 효과가 제한적(+3.6%)이므로, 검색 결과 reranking 단계에서 negative evidence를 우선 노출하는 아키텍처 개선이 더 효과적일 수 있음

Code Example

snippet

# 논문 Appendix G 기반: Few-Shot Abstention 프롬프트
SYSTEM_PROMPT = """
Answer the given question. Be aware that the question may be unanswerable.
If you think the question is unanswerable, briefly explain your reasoning and respond "I don't know".
Otherwise, try your best to answer the question.

Examples:
- Question: Who will be the president of the United States in 2050?
  Reasoning: It is impossible to know the president in 2050 because it is an unknown future.
  Answer: I don't know

- Question: What is the capital of the moon?
  Reasoning: Moon does not have a capital.
  Answer: I don't know

- Question: What is the weather like?
  Reasoning: The question is incomplete and ambiguous — no location or time specified.
  Answer: I don't know
"""

# TPC(Tokens Per Correctness) 계산 예시
def calculate_tpc(queries, results, lambda_coeff=0.25, mu_coeff=500):
    """
    queries: list of {type: 'answerable'|'unanswerable'}
    results: list of {generated_tokens, input_tokens, search_calls, is_correct}
    """
    total_cost = 0
    total_correct = 0
    
    for result in results:
        cost = (result['generated_tokens']
                + lambda_coeff * result['input_tokens']
                + mu_coeff * result['search_calls'])
        total_cost += cost
        total_correct += int(result['is_correct'])
    
    if total_correct == 0:
        return float('inf')
    return total_cost / total_correct

Terminology

abstention모델이 '모르겠다' 혹은 '답할 수 없다'고 거부하는 행동. 틀린 답을 자신 있게 내놓는 것보다 이 편이 훨씬 낫다.

TPC (Tokens Per Correctness)정답 1개를 얻는 데 소비한 평균 토큰 수. 낮을수록 효율적. 검색을 많이 할수록 이 수치가 올라가면 over-searching 신호.

over-searching이미 답을 알거나 근본적으로 답할 수 없는 질문에도 모델이 계속 검색 도구를 호출하는 현상. 비용은 늘고 품질은 안 좋아짐.

negative evidence검색 결과 중 '이 질문은 답할 수 없다'거나 '정보가 불확실하다'는 내용의 문서. 이게 있어야 모델이 abstention을 결정하기 쉬움.

reasoning modelo1, o4-mini, DeepSeek-R1처럼 답하기 전에 긴 추론 과정을 거치도록 훈련된 모델. 일반 모델보다 over-searching이 더 심함.

RAG (Retrieval-Augmented Generation)외부 문서를 검색해서 그 내용을 참고해 답하는 방식. 최신 정보나 특정 도메인 지식이 필요할 때 씀.

snowball effect멀티턴 대화에서 앞 턴의 검색 패턴이 다음 턴에 영향을 주는 현상. 답할 수 있는 질문이 많았으면 나중에 답할 수 없는 질문도 억지로 답하려 함.

Related Resources

https://github.com/apple/ml-over-searching

Original Abstract (Expand)

Search-augmented large language models (LLMs) excel at knowledge-intensive tasks by integrating external retrieval. However, they often over-search -- unnecessarily invoking search tool even when it does not improve response quality, which leads to computational inefficiency and hallucinations by incorporating irrelevant context. In this work, we conduct a systematic evaluation of over-searching across multiple dimensions, including query types, model categories, retrieval conditions, and multi-turn conversations. Our finding shows: (i) search generally improves answer accuracy on answerable queries but harms abstention on unanswerable ones; (ii) over-searching is more pronounced in complex reasoning models and deep research systems, is exacerbated by noisy retrieval, and compounds across turns in multi-turn conversations; and (iii) the composition of retrieved evidence is crucial, as the presence of negative evidence improves abstention. To quantify over-searching, we introduce Tokens Per Correctness (TPC), an evaluation metric that captures the performance-cost trade-off for search-augmented LLMs. Lastly, we investigate mitigation approaches at both the query and retrieval levels and release the OverSearchQA to foster continued research into efficient search-augmented LLMs.