Delusions of Large Language Models
TL;DR Highlight
The 'Delusion' phenomenon—where LLMs produce wrong answers with high confidence—is far harder to catch than ordinary hallucination, and cannot be fixed by fine-tuning or self-reflection.
Who Should Read
AI engineers and product builders dealing with hallucination issues in LLM-based services, especially developers trying to improve reliability with RAG or multi-agent systems.
Core Mechanics
- A new concept distinct from hallucination is defined—'Delusion': cases where the model gives a wrong answer yet holds high confidence in it. Classified using the average confidence on correct answers as a threshold.
- On TriviaQA with Qwen2.5-7B, 20–78% of incorrect answers are classified as Delusion. Even scaling the model up to 72B does not significantly reduce the proportion of delusions among wrong answers.
- Even when prompted to refuse with 'I don't know,' Delusion has a much lower refusal rate than ordinary Hallucination—the model believes its delusions more strongly.
- Even after SFT (fine-tuning) to learn 'refuse what you don't know,' the refusal rate for Delusion remains lower than for Hallucination. Delusion is difficult to fix through fine-tuning.
- Even with a self-reflection prompt (re-examining previous answers), Delusion is maintained at a higher rate than Hallucination.
- Training data noise causes Delusion: the more frequently the same wrong answer appears in training data (higher noise intensity), the sharper the increase in Delusion rate.
Evidence
- With RAG applied, delusion rate dropped from 7.3% → 2.6% (−64.7%) for Llama-3.1-8B and from 8.3% → 2.7% (−67.8%) for Qwen2.5-7B.
- With multi-agent voting (unanimous agreement among 3 models), Mistral-7B's delusion rate dropped from 14.6% → 1.3% (−91.3%).
- After removing delusion-inducing similar samples from training data and retraining, the delusion ratio decreased by 71.3% (Table 2: 9.9% → 3.8%).
- Even after training with 90% SFT refuse data, Llama-3.1-8B's hallucination refusal rate was 99.1% while its delusion refusal rate remained at 96.2%, maintaining the gap.
How to Apply
- If you're using a RAG pipeline, it is also effective at reducing delusion. Grounding with around 20 retrieved passages is sufficient to reduce delusion by ~65%.
- When adding multi-agent voting for critical responses, trusting an answer only when 2 or more out of 3 models agree can reduce delusion by over 80%.
- When managing fine-tuning data quality, check whether the same wrong answer repeats across multiple samples and remove similar samples with cosine similarity > 0.9—this is effective for preventing delusion.
Code Example
# Example of filtering delusion with multi-agent voting
import anthropic
def get_answer(client, model, question):
response = client.messages.create(
model=model,
max_tokens=128,
messages=[{"role": "user", "content": f"Answer concisely: {question}"}]
)
return response.content[0].text.strip()
def multi_agent_vote(question, threshold=2):
client = anthropic.Anthropic()
models = [
"claude-opus-4-6",
"claude-sonnet-4-6",
"claude-haiku-4-5-20251001"
]
answers = [get_answer(client, m, question) for m in models]
# Check for majority agreement
from collections import Counter
counts = Counter(answers)
top_answer, top_count = counts.most_common(1)[0]
if top_count >= threshold:
return top_answer, True # Trustworthy
else:
return None, False # Suspected delusion, reject
question = "Who wrote A Song of Ice and Fire?"
answer, trusted = multi_agent_vote(question, threshold=2)
print(f"Answer: {answer}, Trusted: {trusted}")Terminology
Original Abstract (Expand)
Large Language Models often generate factually incorrect but plausible outputs, known as hallucinations. We identify a more insidious phenomenon, LLM delusion, defined as high belief hallucinations, incorrect outputs with abnormally high confidence, making them harder to detect and mitigate. Unlike ordinary hallucinations, delusions persist with low uncertainty, posing significant challenges to model reliability. Through empirical analysis across different model families and sizes on several Question Answering tasks, we show that delusions are prevalent and distinct from hallucinations. LLMs exhibit lower honesty with delusions, which are harder to override via finetuning or self reflection. We link delusion formation with training dynamics and dataset noise and explore mitigation strategies such as retrieval augmented generation and multi agent debating to mitigate delusions. By systematically investigating the nature, prevalence, and mitigation of LLM delusions, our study provides insights into the underlying causes of this phenomenon and outlines future directions for improving model reliability.