LLM Hallucination 종합 분석: Prompting 전략 vs. 모델 자체 문제 | AI Paper Digest

TL;DR Highlight

제시된 프레임워크가 LLM의 거짓말 원인을 프롬프트 결함과 모델 결함으로 정확히 분리 판별한다.

Who Should Read

GPT-4, LLaMA 등 LLM을 프로덕션에 쓰면서 hallucination(사실과 다른 답변)을 줄이고 싶은 ML 엔지니어나 백엔드 개발자. 프롬프트를 바꿔야 할지 모델을 바꿔야 할지 판단이 안 서는 상황에 유용.

Core Mechanics

Hallucination 원인을 '프롬프트 최적화 부족'과 '모델 내재적 한계' 두 가지로 분류하는 귀인(attribution) 프레임워크 제안
Prompt Sensitivity(PS)와 Model Variability(MV) 두 지표로 hallucination이 어느 쪽 원인인지 수치화 가능
GPT-4, LLaMA 2, DeepSeek 등 최신 모델을 TruthfulQA, HallucinationEval 벤치마크로 비교 실험
Chain-of-Thought(CoT, 단계별 추론 유도) 프롬프트가 PS가 높은 케이스에서 hallucination을 유의미하게 줄임
모델 내재적 한계가 원인인 경우 프롬프트 개선만으로는 한계가 있어 모델 수준 대응이 필요

Evidence

TruthfulQA, HallucinationEval 두 개의 공인 벤치마크를 사용해 여러 모델 조건에서 사실성 평가 수행
CoT 프롬프트가 PS(Prompt Sensitivity) 높은 시나리오에서 hallucination을 '유의미하게(significantly)' 감소시킨다고 보고 (구체적 수치는 abstract에 미공개)
GPT-4, LLaMA 2, DeepSeek 포함 다수 SOTA(최신 성능) 모델을 통제된 프롬프팅 조건 하에서 비교 분석

How to Apply

Hallucination이 자주 발생하면 먼저 PS 측정: 같은 질문을 다양한 프롬프트로 바꿔서 결과가 크게 달라지면 프롬프트 문제 → CoT나 구조화된 프롬프트로 교체
프롬프트를 여러 번 바꿔도 hallucination이 계속 나오면 MV(Model Variability) 높은 케이스로 판단 → 모델 교체 또는 파인튜닝 검토
TruthfulQA 기반 평가 셋을 내부 LLM 파이프라인 테스트에 추가해 hallucination 발생률을 정기적으로 모니터링

Code Example

snippet

# Prompt Sensitivity(PS) 간이 측정 예시
import openai

question = "대한민국의 수도는 어디인가요?"

prompts = [
    f"{question}",
    f"다음 질문에 사실만을 바탕으로 답하세요: {question}",
    f"단계적으로 생각한 뒤 답하세요 (Chain-of-Thought). 질문: {question}",
    f"당신은 사실 확인 전문가입니다. 다음 질문에 정확하게 답하세요: {question}"
]

responses = []
for prompt in prompts:
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    responses.append(response.choices[0].message.content)

# 응답이 서로 다르면 PS 높음 → 프롬프트 개선으로 hallucination 감소 가능
print("=== 프롬프트별 응답 비교 ===")
for i, (p, r) in enumerate(zip(prompts, responses)):
    print(f"[프롬프트 {i+1}] {r[:100]}...\n")

unique_responses = set(responses)
print(f"고유 응답 수: {len(unique_responses)} / {len(responses)}")
print("PS 높음 (프롬프트 개선 필요)" if len(unique_responses) > 1 else "PS 낮음 (모델 내재적 문제일 수 있음)")

Terminology

HallucinationLLM이 그럴듯하게 들리지만 사실이 아닌 내용을 자신 있게 답하는 현상. 마치 모르는 걸 아는 척하는 사람처럼.

Chain-of-Thought (CoT)'천천히 단계적으로 생각해봐'라고 모델에게 지시하는 프롬프트 기법. 수학 문제 풀 때 풀이 과정을 쓰면 정답률이 높아지는 것과 같은 원리.

Prompt Sensitivity (PS)프롬프트를 조금만 바꿔도 모델 답변이 크게 달라지는 정도. PS가 높으면 프롬프트 설계가 hallucination의 주요 원인.

Model Variability (MV)프롬프트와 무관하게 모델 자체에서 오는 답변 불안정성. MV가 높으면 아무리 프롬프트를 잘 써도 한계가 있음.

TruthfulQALLM이 얼마나 사실에 맞는 답을 하는지 측정하는 공인 벤치마크 데이터셋. 사람이 자주 틀리는 질문들로 구성됨.

Attribution어떤 결과가 왜 발생했는지 원인을 찾아내는 것. 여기선 hallucination이 프롬프트 탓인지 모델 탓인지 분석하는 과정.

SOTA (State-of-the-Art)현재 시점에서 가장 높은 성능을 내는 최신 모델들을 통칭하는 표현.

관련 논문

Original Abstract (Expand)

Hallucination in Large Language Models (LLMs) refers to outputs that appear fluent and coherent but are factually incorrect, logically inconsistent, or entirely fabricated. As LLMs are increasingly deployed in education, healthcare, law, and scientific research, understanding and mitigating hallucinations has become critical. In this work, we present a comprehensive survey and empirical analysis of hallucination attribution in LLMs. Introducing a novel framework to determine whether a given hallucination stems from not optimize prompting or the model's intrinsic behavior. We evaluate state-of-the-art LLMs—including GPT-4, LLaMA 2, DeepSeek, and others—under various controlled prompting conditions, using established benchmarks (TruthfulQA, HallucinationEval) to judge factuality. Our attribution framework defines metrics for Prompt Sensitivity (PS) and Model Variability (MV), which together quantify the contribution of prompts vs. model-internal factors to hallucinations. Through extensive experiments and comparative analyses, we identify distinct patterns in hallucination occurrence, severity, and mitigation across models. Notably, structured prompt strategies such as chain-of-thought (CoT) prompting significantly reduce hallucinations in prompt-sensitive scenarios, though intrinsic model limitations persist in some cases. These findings contribute to a deeper understanding of LLM reliability and provide insights for prompt engineers, model developers, and AI practitioners. We further propose best practices and future directions to reduce hallucinations in both prompt design and model development pipelines.