Hallucination Detection and Mitigation in Large Language Models
TL;DR Highlight
A 3-stage framework for systematically detecting and reducing LLM hallucinations by root cause in high-stakes domains like finance and law
Who Should Read
Backend and ML engineers dealing with 'the model confidently says wrong things' in production LLM services. Especially teams operating RAG or fine-tuning in domains like finance, law, and healthcare where wrong answers are costly.
Core Mechanics
- Classifies hallucination causes into 3 categories — model (architecture/training objective), data (knowledge gaps/noise), context (prompt ambiguity/RAG conflicts) — enabling targeted countermeasures
- Organizes detection methods into 5 approaches: probabilistic/semantic entropy, internal state monitoring, external fact-checking, self-consistency checking, and RACE (reasoning-answer consistency evaluation)
- Provides 5 mitigation toolboxes: RAG knowledge grounding, confidence calibration (Temperature Scaling/Isotonic Regression), prompt engineering, decoding control, and fine-tuning
- RACE framework is key: catches cases where 'the answer is correct but the reasoning is wrong' — detecting when financial regulations are cited incorrectly while the conclusion happens to be right
- Open-weight models (LLaMA, Mistral) support advanced detection like MC Dropout and ensemble variance, while closed-weight models (GPT-4 API etc.) are limited to sampling-based proxy measurements
- Demonstrates a closed-loop architecture with detection→mitigation→verification→improvement across 3 tiers (Model/Context/Data Tier) through a financial document extraction case study
Evidence
- The paper focuses on methodology framework without specific numerical benchmarks — no quantitative win rates or accuracy numbers are presented
- ECE example: when a prediction with 0.92 confidence has actual accuracy of 0.75, the bin contribution is |0.75-0.92|=0.17
- Temperature Scaling example: overconfident model (confidence 0.95 → actual accuracy 75%) with T*=1.5 produces calibrated confidence of 0.78
- Semantic Entropy example: 5 responses with 4 agreeing (p=0.8) and 1 dissenting (p=0.2) gives Hs≈0.50; unanimous agreement gives Hs=0
How to Apply
- If you've already invested in RAG: add a fact-checking layer between retrieved documents and model output, run the same prompt 5 times at temperature=0.5, and escalate to human review when there's no consensus — a self-consistency check on top of existing RAG
- For closed-weight APIs (GPT-4 etc.) where internal logits aren't accessible: fire a 'On a scale 0-1, how confident are you?' self-declared uncertainty prompt right after the response, and suppress output below 0.7 as a quick starting point
- For financial/legal document extraction pipelines, apply the 3-tier architecture directly: Model Tier (Temperature Scaling for confidence calibration) → Context Tier (layering 'use only verifiable data from the source' in prompts) → Data Tier (cross-validate extracted values against external DB)
Code Example
# Self-Consistency based hallucination detection example
import openai
from collections import Counter
def self_consistency_check(prompt: str, n_runs: int = 5, temperature: float = 0.5) -> dict:
client = openai.OpenAI()
responses = []
for _ in range(n_runs):
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
max_tokens=200
)
responses.append(resp.choices[0].message.content.strip())
counts = Counter(responses)
most_common, freq = counts.most_common(1)[0]
confidence = freq / n_runs
return {
"answer": most_common,
"confidence": confidence,
"is_hallucination_risk": confidence < 0.6, # unstable if below 60%
"all_responses": responses
}
# Self-Declared Uncertainty prompt example
def get_with_uncertainty(question: str) -> dict:
client = openai.OpenAI()
# Step 1: Generate answer
answer_resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": question}],
temperature=0
)
answer = answer_resp.choices[0].message.content
# Step 2: Self-assess confidence
confidence_resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": question},
{"role": "assistant", "content": answer},
{"role": "user", "content": "On a scale from 0 to 1, how confident are you that the above answer is factually correct? Reply with only a number."}
],
temperature=0
)
try:
confidence = float(confidence_resp.choices[0].message.content.strip())
except ValueError:
confidence = 0.5
return {
"answer": answer,
"self_declared_confidence": confidence,
"needs_verification": confidence < 0.7
}Terminology
Original Abstract (Expand)
Large Language Models (LLMs) and Large Reasoning Models (LRMs) offer transformative potential for high-stakes domains like finance and law, but their tendency to hallucinate, generating factually incorrect or unsupported content, poses a critical reliability risk. This paper introduces a comprehensive operational framework for hallucination management, built on a continuous improvement cycle driven by root cause awareness. We categorize hallucination sources into model, data, and context-related factors, allowing targeted interventions over generic fixes. The framework integrates multi-faceted detection methods (e.g., uncertainty estimation, reasoning consistency) with stratified mitigation strategies (e.g., knowledge grounding, confidence calibration). We demonstrate its application through a tiered architecture and a financial data extraction case study, where model, context, and data tiers form a closed feedback loop for progressive reliability enhancement. This approach provides a systematic, scalable methodology for building trustworthy generative AI systems in regulated environments.