MLLM의 시각 중심 Instruction Following 능력 강화: VC-IFEval 벤치마크

Empowering Reliable Visual-Centric Instruction Following in MLLMs

Jan 6, 2026•Wei He, Feng Ju, Zhiyuan Fan +3•View PDF

TL;DR Highlight

멀티모달 모델이 이미지를 실제로 참조하는지 검증하는 벤치마크와 10k 파인튜닝 데이터셋을 만들었다 — 기존 평가는 이미지 없이도 통과 가능했다.

Who Should Read

멀티모달 LLM(이미지+텍스트 처리 모델)을 평가하거나 파인튜닝하는 ML 엔지니어/연구자. 특히 '모델이 이미지를 진짜 보고 답하는지' 아니면 '텍스트 패턴만으로 답하는지' 구분하고 싶은 사람.

Core Mechanics

기존 MM-IFEval 같은 벤치마크는 이미지 없이도 텍스트 조건만 충족하면 통과 가능 — 진짜 시각 이해가 아닌 언어 습관으로 속임
VC-IFEngine 파이프라인으로 공간(Spatial)/속성(Attribute)/비교(Comparative) 등 10가지 시각 제약 조건을 이미지에 맞게 자동 생성
10k SFT 학습 데이터(VC-IFInstruct)와 10k DPO(선호 학습) 데이터(VC-IFDPO)를 공개 예정
DPO의 rejected 샘플은 '제약 조건 일부 제거' 또는 'Stable Diffusion으로 이미지 편집'으로 생성 — 제약 조건 100% 제거 방식이 성능 최고
최강 오픈소스 Qwen2.5-VL-32B도 VC-IFEval에서 67.3%만 달성 — 현재 모델들의 시각 instruction following이 생각보다 취약
평가 방식을 'GPT-4o 직접 판정'과 '이미지 유/무 응답 비교' 두 가지로 조합해서 언어 편향을 걸러냄

Evidence

Qwen2.5-VL-7B-Instruct: VC-IFInstruct SFT 후 VC-IFEval 57.3% → 63.0%, DPO 후 66.1% (+8.8%p 상승)
LLaVA-NeXT-Llama3-8B: SFT 후 50.1% → 53.3%, DPO 후 60.2% (+10.1%p 상승)
비교 평가 human agreement 92%, 직접 평가 human agreement 90% — 자동 평가 신뢰도 검증 완료
VC-IFDPO 선호 데이터의 80%를 두 어노테이터 모두 올바른 chosen/rejected 쌍으로 검증, IAA Cohen's κ = 0.86

How to Apply

내 모델이 이미지를 실제로 참조하는지 확인하려면: 동일 질문에 이미지 있는 응답 vs 없는 응답을 생성하고, GPT-4o에 'Influenced / Not influenced' 판정 요청
시각 제약 조건 포함 데이터를 만들 때: Spatial(위치), Attribute(색상/질감), Comparative(비교), Counting(개수) 등 10가지 카테고리 중 이미지에 맞는 것을 고르고 구체적으로 재작성
DPO 학습용 rejected 샘플이 필요하면 '이미지 제거'보다 '제약 조건 제거'가 더 효과적 — 이미지를 지우면 오히려 시각 grounding 학습 신호가 약해짐

Code Example

snippet

# VC-IFEval 스타일 비교 평가 프롬프트 (이미지 영향도 측정)
comparative_judge_prompt = """
You are evaluating whether the availability of IMAGE caused a substantive influence on the model's answer.
You will be given the question and two answers:
- Answer A: produced WITH image available.
- Answer B: produced WITHOUT image.

Guidelines:
- If Answer A contains details that plausibly come from visual evidence (objects, layout, colors, counts, attributes)
  and such details are missing/incorrect in Answer B, or the final conclusions differ BECAUSE of visual cues,
  judge it as "Influenced".
- If both answers are essentially the same in conclusions and key details (only minor wording differs),
  judge "Not influenced".

Question: {question}
Answer A (WITH image): {answer_with_image}
Answer B (WITHOUT image): {answer_without_image}

Return exactly one word: Influenced or Not influenced.
"""

# 직접 평가 프롬프트 (제약 조건 충족 여부)
direct_judge_prompt = """
You are asked to judge whether the AI assistant's response fully complies with each listed constraint.
1. Each judgment should be grounded in the visual evidence provided by the image.
2. Assign 1 point if completely satisfied; assign 0 otherwise.

<start of response> {prediction} <end of response>
<start of constraint list> {constraints} <end of constraint list>

Output format: Judgement: ... Summary: constraint_1: x/1, constraint_2: x/1, ...
"""

Terminology

MLLM이미지와 텍스트를 동시에 처리하는 대형 언어 모델. GPT-4o, Qwen2.5-VL 같은 모델이 여기에 해당.

SFTSupervised Fine-Tuning. 모범 답안을 보여주고 따라하게 하는 학습법. 학교에서 예제 풀이 보고 따라 푸는 것과 비슷.

DPODirect Preference Optimization. '이게 좋은 답', '이건 나쁜 답' 쌍을 보여줘서 좋은 답 쪽으로 모델을 유도하는 학습법. 강화학습보다 간단하게 선호도를 학습.

Instruction Following사용자가 시킨 대로 정확히 따르는 능력. '300단어 이내로 써줘', '전경에 있는 것만 묘사해줘' 같은 조건을 지키는 것.

Visual Grounding모델이 답변을 생성할 때 이미지의 실제 내용에 근거하는 것. 이미지를 보지 않고 상상해서 답하면 visual grounding이 없는 것.

CFAConstraint-following Accuracy. 제약 조건을 얼마나 잘 지켰는지 측정하는 점수.

IISImage Influence Score. 이미지가 있을 때와 없을 때 응답이 얼마나 달라지는지 측정하는 점수. 높을수록 이미지를 진짜 참고한 것.

Original Abstract (Expand)

Evaluating the instruction-following (IF) capabilities of Multimodal Large Language Models (MLLMs) is essential for rigorously assessing how faithfully model outputs adhere to user-specified intentions. Nevertheless, existing benchmarks for evaluating MLLMs'instruction-following capability primarily focus on verbal instructions in the textual modality. These limitations hinder a thorough analysis of instruction-following capabilities, as they overlook the implicit constraints embedded in the semantically rich visual modality. To address this gap, we introduce VC-IFEval, a new benchmark accompanied by a systematically constructed dataset that evaluates MLLMs'instruction-following ability under multimodal settings. Our benchmark systematically incorporates vision-dependent constraints into instruction design, enabling a more rigorous and fine-grained assessment of how well MLLMs align their outputs with both visual input and textual instructions. Furthermore, by fine-tuning MLLMs on our dataset, we achieve substantial gains in visual instruction-following accuracy and adherence. Through extensive evaluation across representative MLLMs, we provide new insights into the strengths and limitations of current models.