Hallucination Detection and Mitigation in Large Language Models

Jan 14, 2026•Ahmad Pesaranghader, Erin Li•View PDF

TL;DR Highlight

A 3-stage framework for systematically detecting and reducing LLM hallucinations by root cause in high-stakes domains like finance and law

Who Should Read

Backend and ML engineers dealing with 'the model confidently says wrong things' in production LLM services. Especially teams operating RAG or fine-tuning in domains like finance, law, and healthcare where wrong answers are costly.

Core Mechanics

Classifies hallucination causes into 3 categories — model (architecture/training objective), data (knowledge gaps/noise), context (prompt ambiguity/RAG conflicts) — enabling targeted countermeasures
Organizes detection methods into 5 approaches: probabilistic/semantic entropy, internal state monitoring, external fact-checking, self-consistency checking, and RACE (reasoning-answer consistency evaluation)
Provides 5 mitigation toolboxes: RAG knowledge grounding, confidence calibration (Temperature Scaling/Isotonic Regression), prompt engineering, decoding control, and fine-tuning
RACE framework is key: catches cases where 'the answer is correct but the reasoning is wrong' — detecting when financial regulations are cited incorrectly while the conclusion happens to be right
Open-weight models (LLaMA, Mistral) support advanced detection like MC Dropout and ensemble variance, while closed-weight models (GPT-4 API etc.) are limited to sampling-based proxy measurements
Demonstrates a closed-loop architecture with detection→mitigation→verification→improvement across 3 tiers (Model/Context/Data Tier) through a financial document extraction case study

Evidence

The paper focuses on methodology framework without specific numerical benchmarks — no quantitative win rates or accuracy numbers are presented
ECE example: when a prediction with 0.92 confidence has actual accuracy of 0.75, the bin contribution is |0.75-0.92|=0.17
Temperature Scaling example: overconfident model (confidence 0.95 → actual accuracy 75%) with T*=1.5 produces calibrated confidence of 0.78
Semantic Entropy example: 5 responses with 4 agreeing (p=0.8) and 1 dissenting (p=0.2) gives Hs≈0.50; unanimous agreement gives Hs=0

How to Apply

If you've already invested in RAG: add a fact-checking layer between retrieved documents and model output, run the same prompt 5 times at temperature=0.5, and escalate to human review when there's no consensus — a self-consistency check on top of existing RAG
For closed-weight APIs (GPT-4 etc.) where internal logits aren't accessible: fire a 'On a scale 0-1, how confident are you?' self-declared uncertainty prompt right after the response, and suppress output below 0.7 as a quick starting point
For financial/legal document extraction pipelines, apply the 3-tier architecture directly: Model Tier (Temperature Scaling for confidence calibration) → Context Tier (layering 'use only verifiable data from the source' in prompts) → Data Tier (cross-validate extracted values against external DB)

Code Example

snippet

# Self-Consistency based hallucination detection example
import openai
from collections import Counter

def self_consistency_check(prompt: str, n_runs: int = 5, temperature: float = 0.5) -> dict:
    client = openai.OpenAI()
    responses = []
    
    for _ in range(n_runs):
        resp = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_tokens=200
        )
        responses.append(resp.choices[0].message.content.strip())
    
    counts = Counter(responses)
    most_common, freq = counts.most_common(1)[0]
    confidence = freq / n_runs
    
    return {
        "answer": most_common,
        "confidence": confidence,
        "is_hallucination_risk": confidence < 0.6,  # unstable if below 60%
        "all_responses": responses
    }

# Self-Declared Uncertainty prompt example
def get_with_uncertainty(question: str) -> dict:
    client = openai.OpenAI()
    
    # Step 1: Generate answer
    answer_resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
        temperature=0
    )
    answer = answer_resp.choices[0].message.content
    
    # Step 2: Self-assess confidence
    confidence_resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": question},
            {"role": "assistant", "content": answer},
            {"role": "user", "content": "On a scale from 0 to 1, how confident are you that the above answer is factually correct? Reply with only a number."}
        ],
        temperature=0
    )
    
    try:
        confidence = float(confidence_resp.choices[0].message.content.strip())
    except ValueError:
        confidence = 0.5
    
    return {
        "answer": answer,
        "self_declared_confidence": confidence,
        "needs_verification": confidence < 0.7
    }

Terminology

HallucinationWhen a model confidently generates content that isn't true. Like a student making up a plausible answer to a question they don't know.

ECEExpected Calibration Error. Measures whether a model saying '80% confident' actually gets it right 80% of the time. Higher means overconfidence (wrong but sure).

Semantic EntropyMeasures how spread out multiple answers to the same question are in meaning space. Low when answers cluster together, high when they diverge, suggesting hallucination.

Temperature ScalingThe simplest method to correct model overconfidence. Multiplies a single temperature parameter against the output probability distribution to adjust confidence.

RACEA framework that evaluates not just the final answer but whether the reasoning process is consistent with the answer. Catches cases where 'the answer is right but the reasoning is wrong.'

MC DropoutUses training-time Dropout during inference by running multiple times to measure prediction variance as uncertainty. Like running the same model from multiple perspectives.

RLHFReinforcement Learning from Human Feedback. Humans score model outputs and the model learns from those scores. The key technique behind ChatGPT's user-friendly responses, but risks over-learning fluency at the expense of factuality.

Epistemic UncertaintyUncertainty arising from 'not knowing.' Uncertainty about information absent from training data, which can be reduced by providing more data.

Original Abstract (Expand)

Large Language Models (LLMs) and Large Reasoning Models (LRMs) offer transformative potential for high-stakes domains like finance and law, but their tendency to hallucinate, generating factually incorrect or unsupported content, poses a critical reliability risk. This paper introduces a comprehensive operational framework for hallucination management, built on a continuous improvement cycle driven by root cause awareness. We categorize hallucination sources into model, data, and context-related factors, allowing targeted interventions over generic fixes. The framework integrates multi-faceted detection methods (e.g., uncertainty estimation, reasoning consistency) with stratified mitigation strategies (e.g., knowledge grounding, confidence calibration). We demonstrate its application through a tiered architecture and a financial data extraction case study, where model, context, and data tiers form a closed feedback loop for progressive reliability enhancement. This approach provides a systematic, scalable methodology for building trustworthy generative AI systems in regulated environments.