MedHallu: 의료 도메인 LLM Hallucination 탐지를 위한 종합 Benchmark

MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models

Feb 20, 2025•Shrey Pandit, Jiawei Xu, Junyuan Hong +4•View PDF

TL;DR Highlight

GPT-4o조차 어려운 의료 Hallucination 탐지 문제를 다룬 10,000개 샘플 벤치마크 — F1이 고작 0.625.

Who Should Read

의료 AI 서비스를 개발하거나 LLM 출력의 사실성 검증 파이프라인을 구축하는 백엔드/ML 엔지니어. 특히 헬스케어 챗봇이나 임상 의사결정 지원 시스템에 LLM을 붙이려는 개발자.

Core Mechanics

PubMedQA 기반 10,000개 의료 QA 쌍으로 구성된 MedHallu 벤치마크 공개 — easy/medium/hard 3단계 난이도로 나뉨
GPT-4o도 'hard' 카테고리 Hallucination 탐지 F1이 0.625에 불과 — 의료 특화 fine-tuning된 UltraMedical도 별반 다르지 않음
의료 도메인 fine-tuning 모델(BioMistral, OpenBioLLM 등)이 일반 LLM보다 오히려 Hallucination 탐지 성능이 낮음 — 생성은 잘해도 탐지는 약함
관련 의료 지식을 프롬프트에 같이 넣으면(context-aware) F1이 평균 +0.251 향상 — GPT-4o mini는 0.607 → 0.841로 점프
'not sure' 옵션을 추가해서 모델이 불확실할 때 답변을 거부하게 하면 precision이 최대 38% 향상됨
탐지하기 어려운 Hallucination일수록 ground truth와 의미적으로 더 가까움 — 양방향 entailment 클러스터링으로 확인

Evidence

GPT-4o의 hard 카테고리 Hallucination 탐지 F1: 0.625 (지식 없이) → 0.811 (지식 포함)
일반 LLM 평균 F1: 지식 없이 0.533 → 지식 포함 0.784 (+0.251), 의료 fine-tuned LLM은 0.522 → 0.660 (+0.138)로 상승폭이 절반
'not sure' 옵션 + 지식 제공 시 Qwen2.5-14B F1 88.8%, GPT-4o 84.9% 달성
Incomplete Information 카테고리가 가장 탐지하기 어려움 — 탐지 정확도 54%, Evidence Fabrication은 76.6%로 가장 쉬움

How to Apply

의료 LLM 출력 검증 파이프라인에서 단순 binary 판정 대신 'hallucinated / not-hallucinated / not sure' 3가지 옵션을 주면 precision이 올라감 — 특히 오진 위험이 높은 케이스를 걸러낼 때 유용
Hallucination 탐지 프롬프트에 질문 관련 의료 지식(RAG로 검색한 문서 등)을 context로 같이 넣으면 모델 크기와 무관하게 F1이 크게 오름 — GPT-4o mini 같은 작은 모델도 지식 제공 시 0.84까지 올라감
의료 AI 평가 시 범용 Hallucination 벤치마크(HaluEval 등)만으론 부족 — MedHallu의 easy/medium/hard 분류로 실제 서비스에서 위험한 'hard' 케이스를 따로 측정해야 함

Code Example

snippet

# MedHallu 스타일 의료 Hallucination 탐지 프롬프트 (지식 포함 버전)

detection_prompt = """
You are an AI assistant with extensive knowledge in medicine.
Given a question and an answer, determine if the answer contains hallucinated information.

Hallucination types to check:
- Misinterpretation of Question: off-topic or irrelevant response
- Incomplete Information: omits essential details
- Mechanism/Pathway Misattribution: false biological mechanisms
- Evidence Fabrication: invented statistics or clinical outcomes

World Knowledge: {knowledge}
Question: {question}
Answer: {answer}

Options:
0 - The answer is factual
1 - The answer is hallucinated
2 - Not sure

Return only the integer (0, 1, or 2):
"""

# 사용 예시
knowledge = "Type 1 Diabetes is caused by autoimmune destruction of pancreatic beta cells..."
question = "What is the primary cause of Type 1 Diabetes?"
answer = "A viral infection that specifically targets the pancreas."

prompt = detection_prompt.format(
    knowledge=knowledge,
    question=question,
    answer=answer
)
# → 모델에 전달하면 '1' (hallucinated) 반환 예상

Terminology

HallucinationLLM이 그럴듯하게 들리지만 사실이 아닌 내용을 생성하는 현상. 자신감 있게 거짓말하는 것과 비슷.

F1 Score정밀도(얼마나 정확하게 맞혔나)와 재현율(얼마나 빠짐없이 찾았나)의 조화 평균. 1에 가까울수록 좋음.

Bidirectional Entailment두 문장이 서로를 논리적으로 함의하는지 양방향으로 확인하는 기법. A→B이고 B→A이면 두 문장은 같은 의미.

Zero-shot모델에게 예시 없이 바로 태스크를 수행하게 하는 방식. 시험 당일 처음 본 문제를 푸는 것과 같음.

TextGrad텍스트 피드백을 역전파처럼 활용해 프롬프트나 출력을 자동으로 개선하는 최적화 기법.

NLI (Natural Language Inference)두 문장 사이의 논리적 관계(함의/모순/중립)를 판단하는 태스크. 전제가 결론을 뒷받침하는지 판단함.

Fine-tuned Medical LLM의료 데이터로 추가 학습된 LLM. 의대 공부를 별도로 더 한 모델이라고 보면 됨.

PubMedQAPubMed 논문 기반으로 만들어진 의생명 분야 질의응답 데이터셋. 전문가가 직접 레이블링한 1,000개 샘플 포함.

Related Resources

Original Abstract (Expand)

Advancements in Large Language Models (LLMs) and their increasing use in medical question-answering necessitate rigorous evaluation of their reliability. A critical challenge lies in hallucination, where models generate plausible yet factually incorrect outputs. In the medical domain, this poses serious risks to patient safety and clinical decision-making. To address this, we introduce MedHallu, the first benchmark specifically designed for medical hallucination detection. MedHallu comprises 10,000 high-quality question-answer pairs derived from PubMedQA, with hallucinated answers systematically generated through a controlled pipeline. Our experiments show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with this binary hallucination detection task, with the best model achieving an F1 score as low as 0.625 for detecting"hard"category hallucinations. Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth. Through experiments, we also show incorporating domain-specific knowledge and introducing a"not sure"category as one of the answer categories improves the precision and F1 scores by up to 38% relative to baselines.