Kestrel: LVLM Hallucination 완화를 위한 Grounding 기반 Self-Refinement

Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

Mar 17, 2026•Jiawei Mao, Hardy Chen, Haoqin Tu +7•View PDF

TL;DR Highlight

이미지-언어 모델이 환각(없는 걸 있다고 하는 문제)을 스스로 교정하도록, SAM3로 시각 증거를 수집해 반복 검증하는 학습 없는 프레임워크

Who Should Read

멀티모달 AI 서비스에서 Vision-Language 모델의 환각 문제를 해결해야 하는 ML 엔지니어나 연구자. 특히 Qwen3-VL, InternVL 같은 오픈웨이트 모델을 프로덕션에 적용하려는 개발자.

Core Mechanics

학습 없이(training-free) 테스트 타임에만 동작 — 모델 파라미터를 건드리지 않고 추론 시점에 환각을 교정함
SAM3(이미지에서 개념을 찾아 분리해주는 외부 도구)로 세그멘테이션·바운딩박스·확대뷰를 수집하고, 이를 텍스트 증거로 변환해 재사용 가능한 구조화 데이터로 만듦
답변을 '존재/색상/개수/위치' 같은 개별 claim으로 쪼개고, 각 claim마다 증거를 검증해 verdict(supported/contradicted/insufficient)와 신뢰도 점수를 부여함
과교정(over-correction) 방지를 위해 신뢰도·증거 강도·커버리지가 모두 기준을 충족할 때만 답변을 뒤집는 evidence-gated 업데이트 적용
최대 3라운드 반복하되 early stopping으로 실제로는 대부분 1~2라운드에서 종료 — 9000개 케이스 중 3라운드까지 간 건 477개(5.3%)뿐
Qwen3-VL 8B 기준 POPE 평균 +3.31%, MME-Hallucination +28.34점 향상. 인간 선호도 평가에서도 60케이스 중 41개(68.3%) 선택받아 2위 DeGF(13.3%)를 크게 앞섬

Evidence

POPE 벤치마크: Qwen3-VL 대비 평균 +3.31%, InternVL3.5 대비 평균 +3.03% 향상. 기존 최강 baseline OPERA 대비 +1.38~+1.47%p 추가 개선
MME-Hallucination: Qwen3-VL 기준 731.66 → 760.00 (+28.34), OPERA(743.33) 대비 +16.67점. InternVL3.5 기준 743.33 → 763.34 (+20.01)
인간 선호도 연구(n=60): Kestrel 68.3% vs DeGF 13.3% vs Woodpecker 11.7% vs RITUAL 6.7% vs VCD 0.0%
효율: 첫 iteration에 전체 9000케이스 처리, 2nd에 4978개(55%), 3rd에 477개(5.3%)만 진행. 레이턴시는 베이스라인 0.78s 대비 Kestrel 18.75s(×24), GPU 메모리는 17428MB → 21472MB(×1.23)

How to Apply

VQA 또는 이미지 설명 서비스에서 모델 답변이 나온 뒤, 답변을 claim 단위로 분해(예: '빨간 자동차가 있다' → type:existence, target:자동차)하고 SAM3 API로 해당 객체를 탐지해 증거를 수집하는 후처리 레이어를 추가할 수 있다
신뢰도가 중요한 VQA 파이프라인(의료 이미지, 보험 사고 사진 분석 등)에서 confidence threshold를 조정(논문: 0.82~0.90)해 보수적인 업데이트 기준을 설정하면 과교정 없이 환각만 걸러낼 수 있다
레이턴시가 허용되는 배치 처리 환경(예: 야간 배치로 대량 이미지 캡션 생성)에서 Qwen3-VL이나 InternVL3.5에 Kestrel 파이프라인을 씌우면 추가 학습 없이 정확도를 즉시 올릴 수 있다

Code Example

snippet

# Kestrel 핵심 흐름 - 간략화된 pseudo 코드

# Step 1: Initialization - 초기 답변 + claim 생성
initial_prompt = """
You are given an image and a Yes/No question.
Determine the answer and output one verifiable claim.
- answer must be exactly "Yes" or "No"
- output exactly one claim with fields: id, type, text, targets
- type must be one of: existence, color, count, position
- text must be concrete and visually checkable

Return JSON only:
{
  "answer": "Yes|No",
  "verifiable_claims": [
    {"id": "c1", "type": "existence", "text": "A red car exists in the image", "targets": ["car"]}
  ]
}
Question: {question}
"""

# Step 2: Agent Grounding - SAM3로 시각 증거 수집
# SAM3 API 호출 (개념 기반 세그멘테이션)
visual_evidence = sam3.segment(image, concept=claim.targets[0])
evidence = {
    "e_seg_car": visual_evidence.overlay,      # 세그멘테이션 오버레이
    "e_count_car": len(visual_evidence.masks),  # 인스턴스 개수
    "e_crop_car": visual_evidence.crop_zoom,    # 확대뷰
    "e_pos_car": derive_position(visual_evidence.bbox)  # 위치 텍스트
}

# Step 3: Claim-level Verification
verification_prompt = """
You are a strict verifier. Judge each claim using ONLY the provided evidence.
For each claim, choose exactly one status: supported | contradicted | insufficient
- supported: evidence clearly confirms the claim
- contradicted: evidence clearly refutes the claim  
- insufficient: evidence is missing or ambiguous
Do NOT use common sense.

Return JSON only:
{
  "verdict": "supported|contradicted|insufficient",
  "checked": [
    {
      "claim_id": "c1",
      "status": "contradicted",
      "confidence": 0.92,
      "why": "e_count_car shows 0 instances detected",
      "citations": ["e_count_car", "e_seg_car"]
    }
  ]
}
Question: {question}
Claims: {claims_json}
Evidence: {evidence_json}
"""

# Step 4: Evidence-gated Self-Refinement
# 신뢰도 임계값 (논문 기준)
THRESHOLD = {
    "existence": 0.85,
    "count": 0.85,
    "color": 0.82,
    "position": 0.90
}

def should_update(verdict, claim_type):
    """증거가 충분히 강할 때만 답변 업데이트 허용"""
    if verdict["status"] == "contradicted":
        return verdict["confidence"] >= THRESHOLD[claim_type]
    return False  # 불확실하면 현재 답 유지

# 최대 3라운드, 연속 2번 supported 시 early stop
for round_i in range(3):
    if consecutive_supported >= 2:
        break  # early stopping
    # ... 위 스텝 반복

Terminology

LVLMLarge Vision-Language Model의 약자. 이미지와 텍스트를 함께 이해하는 대형 AI 모델. GPT-4V나 Qwen3-VL처럼 '사진 보고 질문 답하기'가 가능한 모델들.

HallucinationAI가 실제 이미지에 없는 것을 있다고 말하거나 잘못된 정보를 확신하는 현상. 마치 사람이 착각으로 없는 걸 봤다고 주장하는 것과 비슷.

SAM3Segment Anything with Concepts의 3번째 버전. 이미지에서 '자동차', '고양이' 같은 개념을 텍스트로 입력하면 해당 객체의 위치와 영역을 자동으로 찾아주는 도구.

Training-free모델을 추가로 학습시키지 않고 추론(inference) 단계에서만 처리하는 방식. 비용이 적고 어떤 모델에도 바로 적용 가능.

Claim-level Verification답변 전체를 한 번에 검증하는 게 아니라, '존재하는가', '몇 개인가'처럼 작은 단위(claim)로 쪼개서 하나씩 증거와 대조하는 방법.

Evidence-gated Update증거가 충분히 강할 때만 답변을 바꾸는 안전장치. 신뢰도 점수가 임계값을 넘지 않으면 수정을 보류하고 다음 라운드에 더 강한 증거를 수집함.

POPEPolling-based Object Probing Evaluation. LVLM이 이미지에 없는 객체를 있다고 하는지 체계적으로 테스트하는 환각 평가 벤치마크.

MME-HallucinationMME 벤치마크 중 환각 관련 서브셋. 존재/개수/위치/색상 4가지 항목에서 모델이 얼마나 정확한지 점수로 측정.

Related Resources

Original Abstract (Expand)

Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis -- e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.