Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior
TL;DR Highlight
A framework for distinguishing whether an LLM is lying due to the prompt or due to the model itself.
Who Should Read
Researchers and practitioners working on LLM reliability and honesty, especially those debugging unexpected model outputs.
Core Mechanics
- Proposes a taxonomy separating prompt-induced hallucinations from model-intrinsic ones
- Prompt-induced issues: misleading context, adversarial instructions, ambiguous phrasing trigger incorrect outputs even from capable models
- Model-intrinsic issues: factual gaps, reasoning failures, and overconfidence persist regardless of prompt quality
- Introduces diagnostic techniques to attribute a given failure to either source
- Suggests targeted mitigations: prompt redesign for prompt-induced issues vs. fine-tuning / RLHF for model-intrinsic ones
Evidence
- Evaluated across multiple LLMs and task types to validate the prompt vs. model attribution framework
- Shows that many apparent model failures are actually prompt-induced and fixable without retraining
- Provides case studies illustrating each failure mode
How to Apply
- When an LLM gives a wrong or deceptive answer, run the diagnostic checklist: is the prompt ambiguous or adversarial? If yes, fix the prompt first.
- If the error persists across well-formed prompts, treat it as a model-intrinsic issue and consider fine-tuning or RLHF.
- Use this framework when building evaluation suites to avoid misattributing model failures.
Code Example
# Prompt Sensitivity (PS) quick measurement example
import openai
question = "What is the capital of South Korea?"
prompts = [
f"{question}",
f"Answer the following question based on facts only: {question}",
f"Think step by step before answering (Chain-of-Thought). Question: {question}",
f"You are a fact-checking expert. Answer the following question accurately: {question}"
]
responses = []
for prompt in prompts:
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
responses.append(response.choices[0].message.content)
# If responses differ from each other, PS is high → hallucination can be reduced by improving prompts
print("=== Response Comparison by Prompt ===")
for i, (p, r) in enumerate(zip(prompts, responses)):
print(f"[Prompt {i+1}] {r[:100]}...\n")
unique_responses = set(responses)
print(f"Unique response count: {len(unique_responses)} / {len(responses)}")
print("PS high (prompt improvement needed)" if len(unique_responses) > 1 else "PS low (may be an intrinsic model issue)")Terminology
Related Papers
Multilingual Reasoning Cascades Need More Context
번역 cascade 파이프라인에서 원본 질문을 마지막까지 유지하면 추가 학습 없이 다국어 성능이 크게 오른다.
Less Back-and-Forth: A Comparative Study of Structured Prompting
체크리스트 형식으로 프롬프트를 구조화하면 LLM 답변 품질도 높아지고 토큰도 적게 쓴다.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
재학습 없이 각 나라의 도덕적 가치관에 맞게 LLM 출력을 조정하는 추론 시점 기법 DISCA 제안
Using Claude Code: The unreasonable effectiveness of HTML
Claude Code 팀이 Markdown 대신 HTML을 LLM 출력 포맷으로 선호하기 시작한 이유와 그 실용적 장점을 정리한 글로, AI와 함께 문서/스펙/대시보드를 만드는 워크플로우에 직접적인 영향을 준다.
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
Disagreement-guided routing boosts LLM accuracy on math and code by 3-7% with adaptive problem solving.
Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application
Five failure modes and eight practical solutions emerged after five days of running on-device SLMs (Gemma 4 E2B, Qwen3 0.6B) with Wordle.
Original Abstract (Expand)
Hallucination in Large Language Models (LLMs) refers to outputs that appear fluent and coherent but are factually incorrect, logically inconsistent, or entirely fabricated. As LLMs are increasingly deployed in education, healthcare, law, and scientific research, understanding and mitigating hallucinations has become critical. In this work, we present a comprehensive survey and empirical analysis of hallucination attribution in LLMs. Introducing a novel framework to determine whether a given hallucination stems from not optimize prompting or the model's intrinsic behavior. We evaluate state-of-the-art LLMs—including GPT-4, LLaMA 2, DeepSeek, and others—under various controlled prompting conditions, using established benchmarks (TruthfulQA, HallucinationEval) to judge factuality. Our attribution framework defines metrics for Prompt Sensitivity (PS) and Model Variability (MV), which together quantify the contribution of prompts vs. model-internal factors to hallucinations. Through extensive experiments and comparative analyses, we identify distinct patterns in hallucination occurrence, severity, and mitigation across models. Notably, structured prompt strategies such as chain-of-thought (CoT) prompting significantly reduce hallucinations in prompt-sensitive scenarios, though intrinsic model limitations persist in some cases. These findings contribute to a deeper understanding of LLM reliability and provide insights for prompt engineers, model developers, and AI practitioners. We further propose best practices and future directions to reduce hallucinations in both prompt design and model development pipelines.