Incoherence-adjusted Semantic Volume을 활용한 멀티모달 LLM 불확실성 정량화

Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

Feb 27, 2026•Gregory Kang Ruey Lau, Hieu Dao, Nicole Lin +1•View PDF

TL;DR Highlight

멀티모달 LLM이 틀린 답을 낼 것 같은 쿼리를 외부 툴 없이 사전에 감지해서 전문가나 더 큰 모델로 자동 라우팅하는 불확실성 측정 프레임워크

Who Should Read

이미지·오디오·비디오 입력을 처리하는 멀티모달 LLM을 프로덕션에 배포하면서 모델 오답을 사람이나 더 큰 모델로 에스컬레이션하는 파이프라인을 구축하려는 ML 엔지니어. 의료 영상 분석, 멀티모달 QA 시스템처럼 신뢰성이 중요한 서비스를 운영하는 개발자.

Core Mechanics

모델이 생성한 여러 답변의 의미적 분산(Semantic Volume)과 모델 자체 토큰 확률 기반 일관성 점수(Incoherence Score)를 결합해 불확실성을 하나의 점수로 측정
외부 NLI 모델·리워드 모델 없이 MLLM 자체 임베딩과 토큰 확률만으로 동작 — 이미지/오디오/비디오 모두 별도 구현 없이 적용 가능
두 신호가 서로 다른 오류 유형을 보완: Semantic Volume(U)은 모델이 과신하지만 의미적으로 다양한 오답을 잡고, Quadratic Entropy(Q)는 확률이 분산된 오답을 잡음
GPT-4o, Claude 3.5 Haiku 같은 블랙박스 API에도 작은 오픈소스 모델(Llava-v1.5-13b)을 로컬 프록시로 써서 적용 가능
k=5개 샘플링만으로도 단일 샘플 방법(AUROC 0.63)보다 훨씬 높은 AUROC 0.87 달성, Semantic Entropy보다 계산 오버헤드가 약 1000배 낮음

Evidence

이미지-텍스트 벤치마크 평균 AUROC 0.81로 차점(Eigenscore 0.80) 대비 일관되게 우위, 적대적 데이터셋 AdVQA에서 78.7 vs Eigenscore 77.5
CPC(신뢰도 선형성) 전체 평균 90.4로 차점 대비 11% 이상 높음 — 불확실성 점수가 실제 오류율과 선형 비례
ECE(보정 오차) 평균 0.062로 최저 (Semantic Entropy 0.211, Eigenscore 0.227) — min-max 정규화만으로 오류 확률 프록시로 활용 가능
이미지 생성(AnyGPT)에서 CLIP 점수와 Pearson 상관 81.5 vs PUNC(이미지 생성 전용 방법) 44.0

How to Apply

MLLM 추론 파이프라인에서 답변을 k=5~10개 샘플링하고, 마지막 레이어 EOS 토큰 임베딩(ϕ)과 토큰 확률(p)로 UMPIRE 점수를 계산해서 임계값 초과 시 더 큰 모델이나 사람에게 라우팅 — k=50보다 k=5가 10배 빠르면서도 단일 샘플 방법 대비 압도적 성능
GPT-4o 같은 블랙박스 API 사용 시 소형 오픈소스 MLLM(Llava-7b 등)을 로컬 프록시로 운영해서 블랙박스 응답에 대한 UMPIRE 점수 계산 — 프록시 파인튜닝 없이도 동작
레이블 없는 소수의 인스턴스(전체의 1~5%)로 α를 자동 설정하는 adaptive α 전략 사용 — U와 Q의 중앙값 비율로 계산하며 labeled dev set 없이도 최적에 근접

Code Example

snippet

import torch
import numpy as np

def compute_umpire(model, tokenizer, query, k=10, eps=1e-6, alpha=None):
    """
    UMPIRE 불확실성 점수 계산 (Algorithm 1)
    model: HuggingFace MLLM (LLaVA 등)
    query: 멀티모달 입력 (이미지+텍스트 등)
    k: 샘플 수
    """
    embeddings = []  # shape: [k, d]
    probs = []       # shape: [k]

    for _ in range(k):
        with torch.no_grad():
            # temperature=1로 샘플링
            output = model.generate(
                **query,
                do_sample=True,
                temperature=1.0,
                top_p=0.9,
                output_hidden_states=True,
                return_dict_in_generate=True,
            )

        # 마지막 레이어 EOS 토큰 임베딩 추출 후 L2 정규화
        last_hidden = output.hidden_states[-1][-1, -1, :]  # EOS 토큰
        phi = last_hidden / last_hidden.norm()
        embeddings.append(phi.cpu().numpy())

        # 응답 전체의 joint probability 계산
        log_prob = compute_sequence_log_prob(output, model)
        probs.append(np.exp(log_prob))

    Phi = np.stack(embeddings)  # [k, d]
    p = np.array(probs)         # [k]

    # Semantic Volume (U)
    G = Phi @ Phi.T + eps * np.eye(k)  # Gram matrix [k, k]
    sign, logdet = np.linalg.slogdet(G)
    U = logdet / (2 * k)

    # Quadratic Entropy (Q) = incoherence score
    Q = np.mean(1 - p)

    # alpha 자동 설정 (adaptive): U와 Q의 비율
    if alpha is None:
        alpha = U / (Q + 1e-9)  # 실제로는 배치 중앙값 비율 사용

    # UMPIRE score
    V = U + alpha * Q
    return V

# 사용 예시: V가 높으면 불확실 -> 에스컬레이션
# threshold = 0.5  # unlabeled set으로 min-max 정규화 후 설정
# if compute_umpire(model, tokenizer, query) > threshold:
#     route_to_human_or_larger_model(query)

Terminology

MLLM텍스트뿐 아니라 이미지·오디오·비디오도 이해하는 대형 언어 모델. GPT-4o, LLaVA, Phi-4 등이 해당.

AUROC불확실성 점수가 맞은 답과 틀린 답을 얼마나 잘 구분하는지 보는 지표. 1에 가까울수록 좋고, 0.5면 랜덤과 같음.

ECEExpected Calibration Error. 모델이 '70% 확신'이라고 할 때 실제로 70%가 맞는지 측정. 낮을수록 예측 확률이 현실과 일치.

Confabulation모델이 모르는 것에 대해 마치 아는 것처럼 그럴듯하지만 틀린 답을 내뱉는 현상. 흔히 'hallucination'이라고도 불림.

DPPDeterminantal Point Process. 아이템들이 서로 얼마나 다양한지를 행렬식(determinant)으로 수치화하는 수학 도구. 추천 시스템에서 결과 다양성 보장에도 쓰임.

Quadratic Entropy두 번 샘플링했을 때 서로 다른 답이 나올 확률. Shannon 엔트로피보다 극단적으로 낮은 확률값에 덜 민감해서 적은 샘플로도 안정적으로 추정됨.

Semantic Volume여러 답변을 임베딩 공간에 찍었을 때 이 점들이 차지하는 '부피'. 답변들이 의미적으로 다양할수록 부피가 커지고 불확실성이 높음.

Related Resources

https://github.com/daohieu17ctt/UMPIRE

Original Abstract (Expand)

Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models'own internal modality features. UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE's design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. We also demonstrate UMPIRE's generalization to non-text output tasks, including image and audio generation.