Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

Mar 17, 2026•Jiawei Mao, Hardy Chen, Haoqin Tu +7•View PDF

TL;DR Highlight

A training-free framework that lets vision-language models self-correct hallucinations by collecting visual evidence via SAM3 for iterative verification

Who Should Read

ML engineers and researchers tackling hallucination problems in multimodal AI services with Vision-Language models. Especially developers applying open-weight models like Qwen3-VL or InternVL to production.

Core Mechanics

Training-free, operates only at test time — corrects hallucinations at inference without touching model parameters
Uses SAM3 (an external tool that finds and segments concepts in images) to collect segmentation, bounding boxes, and zoomed views, converting them to reusable structured text evidence
Decomposes answers into individual claims (existence/color/count/position) and verifies each against evidence, assigning verdicts (supported/contradicted/insufficient) with confidence scores
Evidence-gated updates to prevent over-correction — only flips answers when confidence, evidence strength, and coverage all meet thresholds
Up to 3 rounds with early stopping; most cases finish in 1-2 rounds — only 477 out of 9,000 cases (5.3%) went to round 3
Qwen3-VL 8B: +3.31% average on POPE, +28.34 points on MME-Hallucination. In human preference evaluation, selected in 41 of 60 cases (68.3%) vs 2nd place DeGF at 13.3%

Evidence

POPE benchmark: +3.31% average over Qwen3-VL, +3.03% over InternVL3.5. Additional +1.38-1.47pp over previous best baseline OPERA
MME-Hallucination: Qwen3-VL 731.66 → 760.00 (+28.34), +16.67 over OPERA (743.33). InternVL3.5: 743.33 → 763.34 (+20.01)
Human preference study (n=60): Kestrel 68.3% vs DeGF 13.3% vs Woodpecker 11.7% vs RITUAL 6.7% vs VCD 0.0%
Efficiency: 1st iteration processes all 9,000 cases, 2nd only 4,978 (55%), 3rd only 477 (5.3%). Latency: baseline 0.78s vs Kestrel 18.75s (x24), GPU memory: 17,428MB → 21,472MB (x1.23)

How to Apply

In VQA or image captioning services, add a post-processing layer after model output that decomposes answers into claims (e.g., 'there's a red car' → type:existence, target:car) and uses SAM3 API to detect the object and collect evidence
In confidence-critical VQA pipelines (medical imaging, insurance claim photo analysis), adjust the confidence threshold (paper: 0.82-0.90) for conservative update criteria to filter hallucinations without over-correction
In latency-tolerant batch processing environments (e.g., overnight batch image captioning), wrapping Qwen3-VL or InternVL3.5 with the Kestrel pipeline provides immediate accuracy gains without additional training

Code Example

snippet

# Kestrel core flow - simplified pseudo code

# Step 1: Initialization - initial answer + claim generation
initial_prompt = """
You are given an image and a Yes/No question.
Determine the answer and output one verifiable claim.
- answer must be exactly "Yes" or "No"
- output exactly one claim with fields: id, type, text, targets
- type must be one of: existence, color, count, position
- text must be concrete and visually checkable

Return JSON only:
{
  "answer": "Yes|No",
  "verifiable_claims": [
    {"id": "c1", "type": "existence", "text": "A red car exists in the image", "targets": ["car"]}
  ]
}
Question: {question}
"""

# Step 2: Agent Grounding - collecting visual evidence with SAM3
# SAM3 API call (concept-based segmentation)
visual_evidence = sam3.segment(image, concept=claim.targets[0])
evidence = {
    "e_seg_car": visual_evidence.overlay,      # segmentation overlay
    "e_count_car": len(visual_evidence.masks),  # instance count
    "e_crop_car": visual_evidence.crop_zoom,    # zoomed-in view
    "e_pos_car": derive_position(visual_evidence.bbox)  # position text
}

# Step 3: Claim-level Verification
verification_prompt = """
You are a strict verifier. Judge each claim using ONLY the provided evidence.
For each claim, choose exactly one status: supported | contradicted | insufficient
- supported: evidence clearly confirms the claim
- contradicted: evidence clearly refutes the claim  
- insufficient: evidence is missing or ambiguous
Do NOT use common sense.

Return JSON only:
{
  "verdict": "supported|contradicted|insufficient",
  "checked": [
    {
      "claim_id": "c1",
      "status": "contradicted",
      "confidence": 0.92,
      "why": "e_count_car shows 0 instances detected",
      "citations": ["e_count_car", "e_seg_car"]
    }
  ]
}
Question: {question}
Claims: {claims_json}
Evidence: {evidence_json}
"""

# Step 4: Evidence-gated Self-Refinement
# confidence threshold (based on paper)
THRESHOLD = {
    "existence": 0.85,
    "count": 0.85,
    "color": 0.82,
    "position": 0.90
}

def should_update(verdict, claim_type):
    """Allow answer update only when evidence is strong enough"""
    if verdict["status"] == "contradicted":
        return verdict["confidence"] >= THRESHOLD[claim_type]
    return False  # keep current answer if uncertain

# max 3 rounds, early stop after 2 consecutive supported verdicts
for round_i in range(3):
    if consecutive_supported >= 2:
        break  # early stopping
    # ... repeat steps above

Terminology

LVLMLarge Vision-Language Model. Large AI models that understand both images and text. Models like GPT-4V or Qwen3-VL that can 'answer questions about photos.'

HallucinationWhen AI claims something exists in an image that doesn't, or confidently states incorrect information. Like a person insisting they saw something that wasn't there.

SAM3Segment Anything with Concepts, version 3. A tool that automatically finds the location and region of objects when you input concept text like 'car' or 'cat.'

Training-freeAn approach that processes only during inference without additional model training. Low cost and immediately applicable to any model.

Claim-level VerificationInstead of verifying the entire answer at once, breaking it into small units (claims) like 'does it exist?' or 'how many?' and checking each against evidence.

Evidence-gated UpdateA safety mechanism that only changes answers when evidence is sufficiently strong. If confidence scores don't exceed the threshold, it defers modification and collects stronger evidence in the next round.

POPEPolling-based Object Probing Evaluation. A hallucination evaluation benchmark that systematically tests whether LVLMs claim non-existent objects are present.

MME-HallucinationThe hallucination subset of the MME benchmark. Scores model accuracy across 4 categories: existence, count, position, and color.

Related Resources

Original Abstract (Expand)

Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis -- e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.