상용 LLM과 Deep Research Agent의 Reference Hallucination 탐지 및 교정

Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

Apr 3, 2026•Delip Rao, Eric Wong, Chris Callison-Burch•View PDF

TL;DR Highlight

GPT-5.1, Gemini, Claude 등 주요 LLM이 생성한 인용 URL의 3~13%는 존재한 적 없는 가짜이며, 오픈소스 툴 urlhealth로 이를 99% 이상 제거할 수 있다.

Who Should Read

LLM 응답에 인용 URL이 포함된 서비스를 개발하거나 운영 중인 백엔드 개발자. AI Research Agent나 RAG 기반 리포트 생성 시스템의 신뢰성을 높이고 싶은 엔지니어.

Core Mechanics

10개 모델을 DRBench(53,090개 URL)로 측정한 결과, 인용 URL의 3~13%는 Wayback Machine에 기록조차 없는 완전 조작 URL(hallucinated URL)이고, 전체 비작동 URL은 5~18%에 달한다.
Deep Research Agent(gemini-2.5-pro-deepresearch, openai-deepresearch)는 쿼리당 41~113개 URL을 생성해 검색 강화 LLM보다 훨씬 많은 인용을 만들지만, hallucination 비율도 더 높다(10.7% vs 4.8%).
GPT-4.1, gpt-4o-search-preview 등 일부 OpenAI 모델은 비작동 URL 전부가 조작(stale URL 0%)인 반면, openai-deepresearch는 비작동의 65%가 실제 존재했다 사라진 stale URL이어서 모델마다 실패 원인이 다르다.
분야별로 비작동률이 Business(5.4%)에서 Theology(11.4%)까지 2배 차이가 나며, Healthcare/Medicine에서 claude-sonnet-4-5는 무려 17.4%를 기록해 의료 정보 서비스에서 특히 위험하다.
인용 수가 많다고 품질이 높아지지 않는다. gpt-5.1은 질문당 46.4개 URL을 생성하지만 비작동률이 8.5%로, 질문당 10.7개를 생성하는 gemini-2.5-pro(4.2%)보다 두 배 높다.
오픈소스 툴 urlhealth를 에이전트 자가 교정 루프에 연결하면 비작동 URL이 GPT-5.1 기준 26배, Gemini 기준 79배 감소해 최종 응답의 비작동률이 1% 미만으로 떨어진다.

Evidence

DRBench 기준 hallucinated URL 비율: claude-3-5-sonnet 3.0% ~ gemini-2.5-pro-deepresearch 13.3%, 전체 비작동 URL 5.4%~18.5% (bootstrap 95% CI 포함).
urlhealth 적용 후 비작동률 감소: GPT-5.1 16.0% → 0.6%(26×), Gemini 6.1% → 0.1%(79×), Claude 4.9% → 0.8%(6.4×), 모두 p < 10⁻³⁵.
Deep Research Agent vs 검색 강화 LLM hallucination 비율: 10.7% vs 4.8% (z=15.15, p < 10⁻⁵¹), 비작동률: 16.2% vs 6.8% (z=20.20, p < 10⁻⁸⁹).
ExpertQA 168,021개 URL 분석 결과 분야별 비작동률 Business 5.4% ~ Theology 11.4%(z=4.83, p < 10⁻⁵), claude-sonnet-4-5의 분야 내 최대 편차는 4.0%(수학) ~ 17.4%(의학)로 4.3배 차이.

How to Apply

LLM 응답 후처리 단계에서 urlhealth를 pip install로 설치하고, 응답에서 추출한 URL마다 LIVE/DEAD/LIKELY_HALLUCINATED/UNKNOWN으로 분류한 뒤 LIKELY_HALLUCINATED URL을 사용자에게 노출하기 전에 제거하거나 경고 표시를 달면 된다.
에이전트 파이프라인에서 urlhealth를 callable tool로 등록하면 모델이 스스로 인용 URL을 검증하고 교체하는 자가 교정 루프를 구성할 수 있다. 단, gpt-5-nano처럼 tool-use 능력이 약한 소형 모델은 검증 결과를 무시하고 같은 URL을 재제안하는 문제가 있으므로 GPT-5.1급 이상 모델에서 사용할 것.
의료, 신학, 고전학처럼 비작동률이 높은 분야의 Q&A 서비스라면 urlhealth 검증을 필수 단계로 삽입하고, stale URL(Wayback Machine에 기록 있음)은 아카이브 링크로 대체하고 hallucinated URL(기록 없음)은 완전 제거하는 방식으로 분기 처리하면 된다.

Code Example

snippet

# pip install urlhealth

from urlhealth import check_url, URLStatus

def filter_hallucinated_citations(urls: list[str]) -> dict:
    """
    LLM 응답에서 추출한 URL 목록을 검증하고
    LIVE / STALE / HALLUCINATED / UNKNOWN 으로 분류
    """
    results = {"live": [], "stale": [], "hallucinated": [], "unknown": []}
    
    for url in urls:
        status = check_url(url)  # HTTP HEAD + Wayback Machine 조회
        
        if status == URLStatus.LIVE:
            results["live"].append(url)
        elif status == URLStatus.DEAD:          # Wayback 기록 있음 → stale
            results["stale"].append(url)
        elif status == URLStatus.LIKELY_HALLUCINATED:  # Wayback 기록 없음
            results["hallucinated"].append(url)
        else:                                   # 타임아웃, 봇 차단 등
            results["unknown"].append(url)
    
    return results

# 사용 예시
urls = [
    "https://example.com/real-paper",
    "https://fake-journal.org/nonexistent-article-2024",
]
result = filter_hallucinated_citations(urls)
print(f"정상: {len(result['live'])}개")
print(f"조작된 URL (제거 대상): {len(result['hallucinated'])}개")
print(f"사라진 URL (아카이브 대체 가능): {len(result['stale'])}개")

Terminology

hallucinated URLLLM이 만들어낸 가짜 URL. 실제로 존재한 적 없는 주소를 그럴듯하게 생성한 것. 논문에서는 Wayback Machine에 기록조차 없는 URL을 이렇게 분류.

stale URL한때 존재했지만 지금은 접속이 안 되는 URL. 링크가 썩은(link rot) 것으로, 조작이 아니라 자연적인 웹 페이지 소멸.

Deep Research Agent단순 검색이 아니라 여러 단계에 걸쳐 검색→읽기→합성을 반복해 긴 리포트를 만드는 AI 에이전트. Google Gemini Deep Research, OpenAI Deep Research 같은 서비스.

Wayback Machine인터넷 아카이브(archive.org)가 운영하는 웹 페이지 스냅샷 저장소. 특정 URL이 과거에 실제로 존재했는지 확인하는 데 사용.

link rot시간이 지나면서 웹 페이지가 사라지거나 URL이 바뀌어 링크가 죽는 현상. Harvard Law Review URL의 70% 이상이 이미 링크롯 상태.

agentic self-correction에이전트가 자신의 출력(여기서는 인용 URL)을 툴로 검증한 뒤 문제가 있으면 스스로 고치는 루프. 사람 개입 없이 품질을 높이는 방식.

RAGRetrieval-Augmented Generation. 외부 문서를 검색해서 그 내용을 바탕으로 답변을 생성하는 기법. 단순 암기 대신 실제 자료를 참조.

Related Resources

Original Abstract (Expand)

Large language models and deep research agents supply citation URLs to support their claims, yet the reliability of these citations has not been systematically measured. We address six research questions about citation URL validity using 10 models and agents on DRBench (53,090 URLs) and 3 models on ExpertQA (168,021 URLs across 32 academic fields). We find that 3--13\% of citation URLs are hallucinated -- they have no record in the Wayback Machine and likely never existed -- while 5--18\% are non-resolving overall. Deep research agents generate substantially more citations per query than search-augmented LLMs but hallucinate URLs at higher rates. Domain effects are pronounced: non-resolving rates range from 5.4\% (Business) to 11.4\% (Theology), with per-model effects even larger. Decomposing failures reveals that some models fabricate every non-resolving URL, while others show substantial link-rot fractions indicating genuine retrieval. As a solution, we release urlhealth, an open-source tool for URL liveness checking and stale-vs-hallucinated classification using the Wayback Machine. In agentic self-correction experiments, models equipped with urlhealth reduce non-resolving citation URLs by $6\textrm{--}79\times$ to under 1\%, though effectiveness depends on the model's tool-use competence. The tool and all data are publicly available. Our characterization findings, failure taxonomy, and open-source tooling establish that citation URL validity is both measurable at scale and correctable in practice.