Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation
TL;DR Highlight
A training-free framework that lets vision-language models self-correct hallucinations by collecting visual evidence via SAM3 for iterative verification
Who Should Read
ML engineers and researchers tackling hallucination problems in multimodal AI services with Vision-Language models. Especially developers applying open-weight models like Qwen3-VL or InternVL to production.
Core Mechanics
- Training-free, operates only at test time — corrects hallucinations at inference without touching model parameters
- Uses SAM3 (an external tool that finds and segments concepts in images) to collect segmentation, bounding boxes, and zoomed views, converting them to reusable structured text evidence
- Decomposes answers into individual claims (existence/color/count/position) and verifies each against evidence, assigning verdicts (supported/contradicted/insufficient) with confidence scores
- Evidence-gated updates to prevent over-correction — only flips answers when confidence, evidence strength, and coverage all meet thresholds
- Up to 3 rounds with early stopping; most cases finish in 1-2 rounds — only 477 out of 9,000 cases (5.3%) went to round 3
- Qwen3-VL 8B: +3.31% average on POPE, +28.34 points on MME-Hallucination. In human preference evaluation, selected in 41 of 60 cases (68.3%) vs 2nd place DeGF at 13.3%
Evidence
- POPE benchmark: +3.31% average over Qwen3-VL, +3.03% over InternVL3.5. Additional +1.38-1.47pp over previous best baseline OPERA
- MME-Hallucination: Qwen3-VL 731.66 → 760.00 (+28.34), +16.67 over OPERA (743.33). InternVL3.5: 743.33 → 763.34 (+20.01)
- Human preference study (n=60): Kestrel 68.3% vs DeGF 13.3% vs Woodpecker 11.7% vs RITUAL 6.7% vs VCD 0.0%
- Efficiency: 1st iteration processes all 9,000 cases, 2nd only 4,978 (55%), 3rd only 477 (5.3%). Latency: baseline 0.78s vs Kestrel 18.75s (x24), GPU memory: 17,428MB → 21,472MB (x1.23)
How to Apply
- In VQA or image captioning services, add a post-processing layer after model output that decomposes answers into claims (e.g., 'there's a red car' → type:existence, target:car) and uses SAM3 API to detect the object and collect evidence
- In confidence-critical VQA pipelines (medical imaging, insurance claim photo analysis), adjust the confidence threshold (paper: 0.82-0.90) for conservative update criteria to filter hallucinations without over-correction
- In latency-tolerant batch processing environments (e.g., overnight batch image captioning), wrapping Qwen3-VL or InternVL3.5 with the Kestrel pipeline provides immediate accuracy gains without additional training
Code Example
# Kestrel core flow - simplified pseudo code
# Step 1: Initialization - initial answer + claim generation
initial_prompt = """
You are given an image and a Yes/No question.
Determine the answer and output one verifiable claim.
- answer must be exactly "Yes" or "No"
- output exactly one claim with fields: id, type, text, targets
- type must be one of: existence, color, count, position
- text must be concrete and visually checkable
Return JSON only:
{
"answer": "Yes|No",
"verifiable_claims": [
{"id": "c1", "type": "existence", "text": "A red car exists in the image", "targets": ["car"]}
]
}
Question: {question}
"""
# Step 2: Agent Grounding - collecting visual evidence with SAM3
# SAM3 API call (concept-based segmentation)
visual_evidence = sam3.segment(image, concept=claim.targets[0])
evidence = {
"e_seg_car": visual_evidence.overlay, # segmentation overlay
"e_count_car": len(visual_evidence.masks), # instance count
"e_crop_car": visual_evidence.crop_zoom, # zoomed-in view
"e_pos_car": derive_position(visual_evidence.bbox) # position text
}
# Step 3: Claim-level Verification
verification_prompt = """
You are a strict verifier. Judge each claim using ONLY the provided evidence.
For each claim, choose exactly one status: supported | contradicted | insufficient
- supported: evidence clearly confirms the claim
- contradicted: evidence clearly refutes the claim
- insufficient: evidence is missing or ambiguous
Do NOT use common sense.
Return JSON only:
{
"verdict": "supported|contradicted|insufficient",
"checked": [
{
"claim_id": "c1",
"status": "contradicted",
"confidence": 0.92,
"why": "e_count_car shows 0 instances detected",
"citations": ["e_count_car", "e_seg_car"]
}
]
}
Question: {question}
Claims: {claims_json}
Evidence: {evidence_json}
"""
# Step 4: Evidence-gated Self-Refinement
# confidence threshold (based on paper)
THRESHOLD = {
"existence": 0.85,
"count": 0.85,
"color": 0.82,
"position": 0.90
}
def should_update(verdict, claim_type):
"""Allow answer update only when evidence is strong enough"""
if verdict["status"] == "contradicted":
return verdict["confidence"] >= THRESHOLD[claim_type]
return False # keep current answer if uncertain
# max 3 rounds, early stop after 2 consecutive supported verdicts
for round_i in range(3):
if consecutive_supported >= 2:
break # early stopping
# ... repeat steps aboveTerminology
Related Resources
Original Abstract (Expand)
Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis -- e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.