Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior | AI Paper Digest

TL;DR Highlight

A framework for distinguishing whether an LLM is lying due to the prompt or due to the model itself.

Who Should Read

Researchers and practitioners working on LLM reliability and honesty, especially those debugging unexpected model outputs.

Core Mechanics

Proposes a taxonomy separating prompt-induced hallucinations from model-intrinsic ones
Prompt-induced issues: misleading context, adversarial instructions, ambiguous phrasing trigger incorrect outputs even from capable models
Model-intrinsic issues: factual gaps, reasoning failures, and overconfidence persist regardless of prompt quality
Introduces diagnostic techniques to attribute a given failure to either source
Suggests targeted mitigations: prompt redesign for prompt-induced issues vs. fine-tuning / RLHF for model-intrinsic ones

Evidence

Evaluated across multiple LLMs and task types to validate the prompt vs. model attribution framework
Shows that many apparent model failures are actually prompt-induced and fixable without retraining
Provides case studies illustrating each failure mode

How to Apply

When an LLM gives a wrong or deceptive answer, run the diagnostic checklist: is the prompt ambiguous or adversarial? If yes, fix the prompt first.
If the error persists across well-formed prompts, treat it as a model-intrinsic issue and consider fine-tuning or RLHF.
Use this framework when building evaluation suites to avoid misattributing model failures.

Code Example

snippet

# Prompt Sensitivity (PS) quick measurement example
import openai

question = "What is the capital of South Korea?"

prompts = [
    f"{question}",
    f"Answer the following question based on facts only: {question}",
    f"Think step by step before answering (Chain-of-Thought). Question: {question}",
    f"You are a fact-checking expert. Answer the following question accurately: {question}"
]

responses = []
for prompt in prompts:
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    responses.append(response.choices[0].message.content)

# If responses differ from each other, PS is high → hallucination can be reduced by improving prompts
print("=== Response Comparison by Prompt ===")
for i, (p, r) in enumerate(zip(prompts, responses)):
    print(f"[Prompt {i+1}] {r[:100]}...\n")

unique_responses = set(responses)
print(f"Unique response count: {len(unique_responses)} / {len(responses)}")
print("PS high (prompt improvement needed)" if len(unique_responses) > 1 else "PS low (may be an intrinsic model issue)")

Terminology

HallucinationWhen an LLM generates text that is factually incorrect or unsupported, stated with false confidence.

Prompt-induced failureAn incorrect output caused by the structure or content of the prompt, not a fundamental model limitation.

Model-intrinsic failureAn incorrect output that persists regardless of prompt quality, indicating a gap in the model's knowledge or reasoning.

RLHFReinforcement Learning from Human Feedback. A training technique that uses human preference signals to align model outputs with desired behavior.

Related Papers

Original Abstract (Expand)

Hallucination in Large Language Models (LLMs) refers to outputs that appear fluent and coherent but are factually incorrect, logically inconsistent, or entirely fabricated. As LLMs are increasingly deployed in education, healthcare, law, and scientific research, understanding and mitigating hallucinations has become critical. In this work, we present a comprehensive survey and empirical analysis of hallucination attribution in LLMs. Introducing a novel framework to determine whether a given hallucination stems from not optimize prompting or the model's intrinsic behavior. We evaluate state-of-the-art LLMs—including GPT-4, LLaMA 2, DeepSeek, and others—under various controlled prompting conditions, using established benchmarks (TruthfulQA, HallucinationEval) to judge factuality. Our attribution framework defines metrics for Prompt Sensitivity (PS) and Model Variability (MV), which together quantify the contribution of prompts vs. model-internal factors to hallucinations. Through extensive experiments and comparative analyses, we identify distinct patterns in hallucination occurrence, severity, and mitigation across models. Notably, structured prompt strategies such as chain-of-thought (CoT) prompting significantly reduce hallucinations in prompt-sensitive scenarios, though intrinsic model limitations persist in some cases. These findings contribute to a deeper understanding of LLM reliability and provide insights for prompt engineers, model developers, and AI practitioners. We further propose best practices and future directions to reduce hallucinations in both prompt design and model development pipelines.