LLM Hallucination 완전 분류 체계: 정의부터 완화 전략까지

A comprehensive taxonomy of hallucinations in Large Language Models

Aug 3, 2025•Manuel Cossio•View PDF

TL;DR Highlight

LLM이 왜 거짓말을 하는지, 어떤 종류가 있는지, 어떻게 막는지를 수학적 증명부터 실전 대응까지 한 권에 정리한 종합 리포트.

Who Should Read

LLM 기반 서비스를 프로덕션에 올리면서 환각(hallucination) 문제를 겪는 백엔드/AI 개발자. 특히 의료, 법률, 금융처럼 오답이 치명적인 도메인에서 신뢰성 높은 AI 시스템을 설계해야 하는 엔지니어.

Core Mechanics

Hallucination은 수학적으로 피할 수 없다 — 어떤 구조의 LLM이든, 어떤 학습 방법이든, 반드시 틀리는 입력이 존재한다는 게 계산 이론으로 증명됨
Hallucination은 단일 현상이 아님 — Intrinsic(입력과 모순), Extrinsic(현실과 모순), Factuality(사실 오류), Faithfulness(지시 이탈) 등 최소 14가지 유형으로 분류됨
코드 생성도 hallucination이 있음 — 문법은 맞지만 요구사항과 다른 코드, 특정 실행 경로에서만 터지는 코드가 대표적인 'Code Generation Hallucination'
모델이 자기 hallucination을 스스로 고칠 수 없다 — Chain-of-Thought 같은 내부 메커니즘만으로는 근본적으로 해결 불가, 외부 guardrail이 필수
사용자도 문제의 일부 — Automation bias(AI니까 맞겠지), Confirmation bias(믿고 싶은 걸 믿음) 같은 인지 편향이 hallucination을 놓치게 만들고, 경고 문구를 줘도 행동이 바뀌지 않음
RAG(외부 문서 검색 결합)가 현재 가장 효과적인 완화 전략이지만, RAG도 완벽하지 않음 — 검색된 문서를 모델이 잘못 해석하면 hallucination이 발생

Evidence

Logical inconsistency는 전체 hallucination 케이스의 19%를 차지하고, Temporal disorientation은 12%, Ethical violations는 6%를 차지함
GPT-4의 오답이 블라인드 평가에서 인간 전문가의 정답보다 더 설득력 있다고 평가된 사례 존재 (Bubeck et al., 2023)
MedHallu 벤치마크는 PubMedQA 기반 10,000개 QA 쌍으로 구성, 의료 hallucination 탐지에 특화
Epoch AI 데이터 기준, MATH Level 5에서 학습 컴퓨팅 10배 증가마다 정확도 약 17%p 향상 — 더 많은 컴퓨팅이 hallucination 감소로 직결

How to Apply

도메인별 hallucination 유형을 먼저 파악하라 — 코드 생성에는 CodeHaluEval 기준(input-conflicting, context-conflicting, fact-conflicting)으로 테스트 케이스를 만들고, 의료/법률 도메인이면 MedHallu나 Med-HALT 같은 도메인 특화 벤치마크를 써서 위험 유형을 먼저 진단할 것
RAG + Guardrail 레이어드 구조로 대응하라 — RAG로 사실 근거를 제공하고, 그 위에 Logic validator(수식 검증, 코드 실행 검증)와 Factual filter(지식 그래프 대조)를 추가로 얹어서 단일 방어선에 의존하지 말 것
UI에 불확실성 표시를 반드시 넣어라 — 신뢰도 점수, '이 내용은 외부 문서 기반입니다' 같은 출처 표시, '왜 이렇게 답했나요?' 버튼을 추가해 사용자의 Automation bias를 줄일 것

Code Example

snippet

# RAG + 사실 검증 레이어 간단 예시
import anthropic

client = anthropic.Anthropic()

def rag_with_validation(user_query: str, retrieved_docs: list[str]) -> dict:
    context = "\n\n".join(retrieved_docs)
    
    # Step 1: RAG 기반 답변 생성
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""다음 문서를 참고해서 질문에 답하세요.
문서에 없는 내용은 '문서에 근거 없음'으로 표시하세요.

[참고 문서]
{context}

[질문]
{user_query}"""
        }]
    )
    answer = response.content[0].text
    
    # Step 2: 답변의 hallucination 가능성 자가 평가
    validation = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": f"""아래 답변이 제공된 문서에서 지지되는지 확인하세요.
확신 수준을 high/medium/low로 평가하고 이유를 한 줄로 설명하세요.

[문서]
{context}

[답변]
{answer}

출력 형식: {{"confidence": "high|medium|low", "reason": "..."}}"""
        }]
    )
    
    return {
        "answer": answer,
        "validation": validation.content[0].text,
        "sources": retrieved_docs
    }

# 사용 예시
result = rag_with_validation(
    user_query="이 약의 부작용은 무엇인가요?",
    retrieved_docs=["임상 시험 결과 문서...", "FDA 승인 자료..."]
)
print(result)

Terminology

HallucinationLLM이 그럴듯하지만 사실이 아닌 내용을 자신있게 생성하는 현상. 모델이 거짓말을 '의도적으로' 하는 게 아니라, 통계적으로 그럴듯한 다음 단어를 이어붙이다 보니 사실과 어긋나는 것.

Intrinsic Hallucination모델이 주어진 입력 안에서 스스로 모순되는 답을 내는 것. 같은 요약문 안에서 '1980년생'이라고 했다가 '1975년생'이라고 바꾸는 것처럼.

Extrinsic Hallucination입력에는 없지만 현실에도 존재하지 않는 정보를 만들어내는 것. '1885년에 파리 호랑이가 멸종됐다'처럼 아예 없는 사실을 창작하는 것.

RAGRetrieval-Augmented Generation. LLM이 답변할 때 외부 문서를 검색해서 참고 자료로 붙여주는 기법. 모델의 기억에만 의존하지 않고, 위키피디아나 사내 DB 같은 검색 결과를 문맥으로 제공해 사실 오류를 줄임.

Automation BiasAI가 틀렸어도 '기계가 맞겠지'라고 믿는 인간의 심리적 경향. 시간 압박이나 전문 지식 부족 상황에서 더 심해짐.

GuardrailLLM 출력을 후처리로 검증하거나 제한하는 안전장치. 수식 검증기, 지식 그래프 대조, 신뢰도 임계값 미달 시 답변 거부 등이 포함됨.

FActScore요약문의 각 문장이 원본 문서에서 논리적으로 추론 가능한지를 세분화해 평가하는 벤치마크. ROUGE/BLEU보다 미묘한 사실 오류를 잘 잡아냄.

Calibration모델의 자신감(confidence 점수)이 실제 정확도와 얼마나 일치하는지를 나타내는 지표. 잘못 보정된 모델은 틀린 답에도 높은 확신을 보여 사용자를 오도함.

Related Resources

Original Abstract (Expand)

Large language models (LLMs) have revolutionized natural language processing, yet their propensity for hallucination, generating plausible but factually incorrect or fabricated content, remains a critical challenge. This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a formal definition and a theoretical framework that posits its inherent inevitability in computable LLMs, irrespective of architecture or training. It explores core distinctions, differentiating between intrinsic (contradicting input context) and extrinsic (inconsistent with training data or reality), as well as factuality (absolute correctness) and faithfulness (adherence to input). The report then details specific manifestations, including factual errors, contextual and logical inconsistencies, temporal disorientation, ethical violations, and task-specific hallucinations across domains like code generation and multimodal applications. It analyzes the underlying causes, categorizing them into data-related issues, model-related factors, and prompt-related influences. Furthermore, the report examines cognitive and human factors influencing hallucination perception, surveys evaluation benchmarks and metrics for detection, and outlines architectural and systemic mitigation strategies. Finally, it introduces web-based resources for monitoring LLM releases and performance. This report underscores the complex, multifaceted nature of LLM hallucinations and emphasizes that, given their theoretical inevitability, future efforts must focus on robust detection, mitigation, and continuous human oversight for responsible and reliable deployment in critical applications.