Delusions of Large Language Models

Mar 9, 2025•Hongshen Xu, Zi Yang, Zichen Zhu +7•View PDF

TL;DR Highlight

The 'Delusion' phenomenon—where LLMs produce wrong answers with high confidence—is far harder to catch than ordinary hallucination, and cannot be fixed by fine-tuning or self-reflection.

Who Should Read

AI engineers and product builders dealing with hallucination issues in LLM-based services, especially developers trying to improve reliability with RAG or multi-agent systems.

Core Mechanics

A new concept distinct from hallucination is defined—'Delusion': cases where the model gives a wrong answer yet holds high confidence in it. Classified using the average confidence on correct answers as a threshold.
On TriviaQA with Qwen2.5-7B, 20–78% of incorrect answers are classified as Delusion. Even scaling the model up to 72B does not significantly reduce the proportion of delusions among wrong answers.
Even when prompted to refuse with 'I don't know,' Delusion has a much lower refusal rate than ordinary Hallucination—the model believes its delusions more strongly.
Even after SFT (fine-tuning) to learn 'refuse what you don't know,' the refusal rate for Delusion remains lower than for Hallucination. Delusion is difficult to fix through fine-tuning.
Even with a self-reflection prompt (re-examining previous answers), Delusion is maintained at a higher rate than Hallucination.
Training data noise causes Delusion: the more frequently the same wrong answer appears in training data (higher noise intensity), the sharper the increase in Delusion rate.

Evidence

With RAG applied, delusion rate dropped from 7.3% → 2.6% (−64.7%) for Llama-3.1-8B and from 8.3% → 2.7% (−67.8%) for Qwen2.5-7B.
With multi-agent voting (unanimous agreement among 3 models), Mistral-7B's delusion rate dropped from 14.6% → 1.3% (−91.3%).
After removing delusion-inducing similar samples from training data and retraining, the delusion ratio decreased by 71.3% (Table 2: 9.9% → 3.8%).
Even after training with 90% SFT refuse data, Llama-3.1-8B's hallucination refusal rate was 99.1% while its delusion refusal rate remained at 96.2%, maintaining the gap.

How to Apply

If you're using a RAG pipeline, it is also effective at reducing delusion. Grounding with around 20 retrieved passages is sufficient to reduce delusion by ~65%.
When adding multi-agent voting for critical responses, trusting an answer only when 2 or more out of 3 models agree can reduce delusion by over 80%.
When managing fine-tuning data quality, check whether the same wrong answer repeats across multiple samples and remove similar samples with cosine similarity > 0.9—this is effective for preventing delusion.

Code Example

snippet

# Example of filtering delusion with multi-agent voting
import anthropic

def get_answer(client, model, question):
    response = client.messages.create(
        model=model,
        max_tokens=128,
        messages=[{"role": "user", "content": f"Answer concisely: {question}"}]
    )
    return response.content[0].text.strip()

def multi_agent_vote(question, threshold=2):
    client = anthropic.Anthropic()
    models = [
        "claude-opus-4-6",
        "claude-sonnet-4-6",
        "claude-haiku-4-5-20251001"
    ]
    answers = [get_answer(client, m, question) for m in models]
    
    # Check for majority agreement
    from collections import Counter
    counts = Counter(answers)
    top_answer, top_count = counts.most_common(1)[0]
    
    if top_count >= threshold:
        return top_answer, True   # Trustworthy
    else:
        return None, False        # Suspected delusion, reject

question = "Who wrote A Song of Ice and Fire?"
answer, trusted = multi_agent_vote(question, threshold=2)
print(f"Answer: {answer}, Trusted: {trusted}")

Terminology

HallucinationThe phenomenon where an LLM plausibly fabricates content that is factually incorrect—confidently stating answers it has essentially 'made up.'

DelusionA concept newly defined in this paper. A subset of hallucination where the model holds particularly high confidence in its wrong answer—a hallucination it stubbornly maintains even when told it is incorrect.

LogitsThe raw scores computed by the model when outputting each token. Higher values indicate the model is more confident about that particular word.

SFTShort for Supervised Fine-Tuning. A training method where the model is shown correct examples and learns to replicate them—similar to studying worked solutions in school.

Confidence CalibrationThe process of aligning a model's expressed confidence with its actual accuracy. A well-calibrated model that claims 90% confidence should be correct 90% of the time.

RAGRetrieval-Augmented Generation. A method that has the model search and reference external documents before answering, so it doesn't rely solely on its internal knowledge.

Uncertainty EstimationA technique for numerically measuring how confident a model is in its own answers. High uncertainty signals that the model doesn't know the answer well.

Original Abstract (Expand)

Large Language Models often generate factually incorrect but plausible outputs, known as hallucinations. We identify a more insidious phenomenon, LLM delusion, defined as high belief hallucinations, incorrect outputs with abnormally high confidence, making them harder to detect and mitigate. Unlike ordinary hallucinations, delusions persist with low uncertainty, posing significant challenges to model reliability. Through empirical analysis across different model families and sizes on several Question Answering tasks, we show that delusions are prevalent and distinct from hallucinations. LLMs exhibit lower honesty with delusions, which are harder to override via finetuning or self reflection. We link delusion formation with training dynamics and dataset noise and explore mitigation strategies such as retrieval augmented generation and multi agent debating to mitigate delusions. By systematically investigating the nature, prevalence, and mitigation of LLM delusions, our study provides insights into the underlying causes of this phenomenon and outlines future directions for improving model reliability.