Multimodal LLM을 활용한 Out-of-Distribution Detection: MM-OOD

Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model

Jan 20, 2026•Haoran Xu, Yanlin Liu, Zizhao Tong +8•View PDF

TL;DR Highlight

텍스트만 쓰던 OOD 탐지에 이미지+텍스트 멀티모달 추론을 더해, CLIP 위에서 zero-shot으로 이상 샘플을 더 잘 잡아내는 MM-OOD 프레임워크.

Who Should Read

프로덕션 CV 시스템에서 학습 분포 밖 이상 샘플(OOD)을 탐지해야 하는 ML 엔지니어. 특히 추가 학습 없이 zero-shot으로 OOD 감지 파이프라인을 구축하려는 상황에 유용하다.

Core Mechanics

기존 EOE는 LLM 텍스트 추론만으로 outlier 클래스를 상상했는데, MM-OOD는 LLaVA 같은 멀티모달 LLM의 이미지+텍스트 동시 이해를 활용해 더 다양한 outlier를 생성
Near OOD(개↔늑대처럼 시각적으로 유사한 클래스 구분)는 실제 ID 이미지를 MLLM에 바로 넣어서 유사 outlier 클래스 레이블을 뽑아냄
Far OOD(음식↔자동차처럼 의미상 완전히 다른 클래스 구분)에는 sketch-generate-elaborate 3단계 프레임워크 도입: 텍스트로 outlier 스케치 → Stable Diffusion v1.5로 OOD 이미지 생성 → 그 이미지를 MLLM에 다시 넣어 최종 레이블 정제
ID 이미지를 MLLM에 직접 넣으면 MLLM이 ID 공간 근처만 탐색하는 inductive bias 문제 발생 — far OOD에는 생성 모델로 만든 OOD 이미지를 입력해서 이 편향을 우회
GPT-4로 broad category를 먼저 생성하고, LLaVA-1.5-7B 또는 Qwen2-VL로 outlier 클래스 레이블을 제안한 뒤, CLIP 텍스트 인코더로 ID/OOD 분류 스코어 계산
Near/Far OOD 구분이 불확실한 실제 상황에서는 두 브랜치 결과를 0.5 비율로 혼합해서 사용 가능

Evidence

Near OOD: ImageNet-10 기준 FPR95 3.84% (EOE 7.01% 대비 3.17%p 개선, Energy 13.81% 대비 9.97%p 개선)
Far OOD 평균 (L=12×K): FPR95 4.33%, AUROC 99.56% — 비교 대상인 EOE, MaxLogit, Energy, MCM 전부 앞섬
Food-101 데이터셋에서 LLaVA 기준 평균 FPR95 1.12% (EOE 2.22% 대비 약 50% 개선)
LLaVA-1.5 기준 평균 FPR95 1.94%, AUROC 99.62% vs EOE의 3.13%, 99.35% (여러 primary category 수 M 설정에서 일관된 우위)

How to Apply

CLIP 기반 이미지 분류기에 OOD 탐지를 붙이려면: GPT-4로 ID 클래스의 broad category 생성 → LLaVA에 ID 이미지+텍스트 프롬프트를 넣어 outlier 클래스 레이블 생성 → CLIP 텍스트 인코더로 ID/outlier 레이블을 함께 인코딩 → `S(x) = max_ID_score - 0.25 * max_OOD_score` 수식으로 탐지 스코어 계산
의료 이미지, 자율주행 등 far OOD가 중요한 시스템이라면 sketch-generate-elaborate 패턴 적용: (1) LLM으로 텍스트 outlier 레이블 스케치 (2) Stable Diffusion으로 해당 이미지 생성 (3) 생성된 이미지+텍스트를 MLLM에 넣어 최종 outlier 레이블 정제 — LLaVA나 Qwen2-VL 모두 사용 가능
Near/Far OOD 종류를 미리 알 수 없는 경우, 두 브랜치에서 각각 outlier 레이블을 생성하고 0.5:0.5로 섞어서 CLIP 분류기에 입력하면 별도 설정 없이도 동작

Code Example

snippet

# Near OOD 탐지용 MLLM 프롬프트 예시 (논문 Appendix A 기반)
prompt_template = """
Q: Given the image category [{id_class}] and this image,
please suggest visually similar categories that are not directly
related or belong to the same primary group as [{id_class}].
Provide suggestions that share visual characteristics but are
from broader and different domains than [{id_class}].

A: There are {num_outliers} classes similar to [{id_class}],
and they are from broader and different domains than [{id_class}]:
"""

# Far OOD: sketch-generate-elaborate 흐름
def sketch_generate_elaborate(id_labels, mllm, diffusion_model):
    # 1. Sketch: 텍스트만으로 outlier 클래스 초안
    sketch_labels = mllm(prompt_sketch(id_labels))
    
    # 2. Generate: 대표 outlier 레이블로 이미지 생성
    representative = mllm(prompt_select_representative(sketch_labels))
    ood_image = diffusion_model.generate(representative)
    
    # 3. Elaborate: 생성 이미지를 MLLM에 넣어 최종 레이블 정제
    final_labels = mllm(prompt_elaborate(id_labels, ood_image))
    return final_labels

# CLIP 기반 탐지 스코어 계산
import torch
import torch.nn.functional as F

def compute_ood_score(image_feat, text_feats_id, text_feats_ood, beta=0.25):
    all_feats = torch.cat([text_feats_id, text_feats_ood], dim=0)
    logits = F.cosine_similarity(image_feat.unsqueeze(0), all_feats)
    exp_logits = torch.exp(logits)
    softmax_all = exp_logits / exp_logits.sum()
    
    K = len(text_feats_id)
    id_score = softmax_all[:K].max()
    ood_score = softmax_all[K:].max()
    
    return id_score - beta * ood_score  # 높을수록 ID

Terminology

OOD (Out-of-Distribution)모델이 학습한 데이터 분포 밖의 샘플. 개 사진만 보고 학습한 모델에 고양이를 보여주면 OOD. 실제 서비스에서 예상 못한 입력이 들어오는 상황과 같음.

CLIPOpenAI가 만든 이미지-텍스트 쌍으로 학습된 모델. 이미지와 텍스트를 같은 벡터 공간에 임베딩해서 '이 이미지는 강아지 설명과 얼마나 닮았나'를 수치로 계산할 수 있음.

MLLM텍스트와 이미지 둘 다 입력으로 받는 대형 언어 모델. GPT-4V, LLaVA, Qwen2-VL 같은 모델이 여기에 해당.

LLaVA오픈소스 멀티모달 LLM. CLIP 비전 인코더와 LLaMA 언어 모델을 연결한 구조로, 이미지를 보고 텍스트로 답변 가능.

FPR95진짜 ID 샘플을 95% 정확히 맞힐 때 OOD 샘플을 ID로 잘못 분류하는 비율. 낮을수록 좋고, 0%가 완벽.

AUROCROC 곡선 아래 면적. 1.0이 완벽한 분류, 0.5는 랜덤 수준. ID와 OOD를 얼마나 잘 구분하는지 종합 점수.

Zero-shot특정 클래스 샘플을 한 번도 학습하지 않고도 그 클래스를 인식하는 능력. 학습 데이터 없이 새로운 카테고리를 즉시 처리.

CoT (Chain of Thought)LLM이 답을 바로 내지 않고 중간 추론 단계를 step-by-step으로 거치게 하는 프롬프팅 기법. 복잡한 문제에서 정확도가 높아짐.

Original Abstract (Expand)

Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention. The emergence of CLIP has spurred extensive research into zero-shot OOD detection, often employing a training-free approach. Current methods leverage expert knowledge from large language models (LLMs) to identify potential outliers. However, these approaches tend to over-rely on knowledge in the text space, neglecting the inherent challenges involved in detecting out-of-distribution samples in the image space. In this paper, we propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs and their ability to conduct multi-round conversations for enhanced outlier detection. Our method is designed to improve performance in both near OOD and far OOD tasks. Specifically, (1) for near OOD tasks, we directly feed ID images and corresponding text prompts into MLLMs to identify potential outliers; and (2) for far OOD tasks, we introduce the sketch-generate-elaborate framework: first, we sketch outlier exposure using text prompts, then generate corresponding visual OOD samples, and finally elaborate by using multimodal prompts. Experiments demonstrate that our method achieves significant improvements on widely used multimodal datasets such as Food-101, while also validating its scalability on ImageNet-1K.