VisualPRM: 멀티모달 추론을 위한 Process Reward Model

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Mar 13, 2025•Weiyun Wang, Zhangwei Gao, Lianjie Chen +12•View PDF

TL;DR Highlight

이미지+텍스트 추론 문제에서 각 풀이 단계의 정확도를 채점하는 8B 규모의 심판 모델로, 기존 모델에 플러그인해서 추론 성능을 최대 8.4포인트 올릴 수 있다.

Who Should Read

멀티모달 LLM(이미지+텍스트)을 사용한 수학/추론 파이프라인을 운영하거나 개선하려는 ML 엔지니어. 특히 Best-of-N 샘플링으로 추론 품질을 높이고 싶은 팀.

Core Mechanics

풀이 전체가 아닌 각 단계(step)별로 정확도를 채점하는 PRM(Process Reward Model) 방식이 ORM(결과만 보는 모델)이나 Self-Consistency보다 Best-of-N에서 일관되게 우수함
400K 규모의 멀티모달 프로세스 감독 데이터셋 VisualPRM400K를 자동 파이프라인으로 구축 — 각 단계에서 여러 번 완성을 샘플링해 기대 정확도(mc)를 계산
InternVL2.5-8B, MiniCPM-V2.6, Qwen2.5-VL-7B, InternVL2.5-78B에 Best-of-8 전략 적용 시 전반적 성능이 각각 8.4, 8.0, 3.7, 5.9포인트 향상
기존 오픈소스 MLLM(InternVL2.5-78B 포함)을 심판 모델로 쓰면 거의 개선이 없음 — 대부분 스텝을 '맞다'고만 판단하는 긍정 편향이 원인
VisualPRM은 단일 forward pass로 모든 스텝 점수를 계산해, 오토리그레시브로 판단하는 MLLM-as-Judge보다 추론 지연이 훨씬 낮음
2,866개 샘플, 26,950개 사람이 직접 라벨링한 스텝별 정오 정보를 담은 평가 벤치마크 VisualProcessBench 공개

Evidence

Best-of-8에서 VisualPRM이 ORM 대비 1.5포인트, Self-Consistency 대비 2.4포인트 우세 (InternVL2.5-8B 기준); N=128로 늘리면 격차가 각각 4.3, 3.1포인트로 확대
VisualProcessBench F1 점수: VisualPRM 62.0 vs GPT-4o 60.3 vs Gemini-2.0-Flash 62.3 — 8B 오픈소스 모델이 독점 모델과 동등한 스텝 검증 능력 달성
텍스트 전용 벤치마크에서도 효과적 — Qwen2.5-7B MATH-500 +6.1pt, GPQA-Diamond +5.0pt; InternVL2.5-8B MATH-500 +9.4pt
N을 8→128로 늘릴수록 VisualPRM은 꾸준히 성능 향상(InternVL2.5-8B: 41.2→44.0), ORM은 N=64 이후 정체 또는 하락

How to Apply

기존 멀티모달 모델(InternVL2.5, Qwen2.5-VL 등)로 같은 문제를 temperature=0.7로 N번(8~32) 샘플링한 뒤, VisualPRM-8B로 각 풀이의 스텝 점수 평균을 구해 가장 높은 답을 선택하면 된다 — 모델 파인튜닝 없이 추론 품질 향상 가능
스텝 점수 집계 방식은 '평균'이 가장 안정적; '최댓값'은 초반 쉬운 스텝에 점수가 몰려 성능이 낮아지므로 피할 것
자체 프로세스 감독 데이터가 필요한 경우, VisualPRM400K 파이프라인처럼 각 스텝에서 16개 완성을 Monte Carlo 샘플링해 기대 정확도(mc)를 추정하고 mc>0이면 정답 스텝으로 처리하는 방식을 그대로 복제할 수 있다

Code Example

snippet

# Best-of-N 선택 예시 (의사코드)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. 정책 모델로 N개 후보 생성
policy_model = load_model("InternVL2.5-8B")
candidates = [
    policy_model.generate(image, question, temperature=0.7)
    for _ in range(8)  # N=8
]

# 2. VisualPRM으로 각 후보 스텝 채점
prm = load_model("VisualPRM-8B")

def score_response(image, question, solution_steps):
    step_scores = []
    for i, step in enumerate(solution_steps):
        # 각 스텝의 '+' 토큰 생성 확률을 점수로 사용
        prob_correct = prm.get_token_prob(
            image, question, solution_steps[:i+1], token="+"
        )
        step_scores.append(prob_correct)
    return sum(step_scores) / len(step_scores)  # 평균 집계

# 3. 가장 높은 점수의 후보 선택
scores = [score_response(image, question, c.steps) for c in candidates]
best_response = candidates[scores.index(max(scores))]

# VisualPRM 프롬프트 형식 (multi-turn chat)
# Turn 1: <image> + question + step_0
# Turn 2: step_1  → 모델이 '+' 또는 '-' 예측
# Turn N: step_n  → 모델이 '+' 또는 '-' 예측

Terminology

PRMProcess Reward Model의 약자. 최종 답만 채점하는 게 아니라 풀이의 각 단계가 맞는지 하나씩 검사하는 심판 모델. 시험 채점관이 답만 보는 게 아니라 풀이 과정 전체에 부분 점수를 주는 것과 같다.

ORMOutcome Reward Model의 약자. 풀이 과정은 무시하고 최종 답이 맞는지만 채점하는 방식. PRM과 대조되는 개념.

Best-of-N같은 문제에 대해 모델이 N개의 답을 생성하고, 그 중 심판 모델이 가장 좋다고 평가한 답을 최종 답으로 선택하는 전략. 복권을 N장 사서 가장 좋은 걸 내는 것과 비슷.

Test-Time Scaling모델을 다시 학습시키지 않고, 추론(inference) 시점에 더 많은 계산을 써서 성능을 높이는 기법. 돈 더 써서 더 많이 시도하고 최선을 고르는 방식.

Self-Consistency같은 문제를 여러 번 풀어서 가장 많이 나온 답을 선택하는 방법. 다수결 투표와 같은 원리.

Monte Carlo 샘플링확률적으로 여러 번 시도해서 평균으로 어떤 값을 추정하는 기법. 주사위를 100번 던져서 6이 나오는 확률을 추정하는 것과 같은 원리로, 여기서는 특정 스텝에서 풀이를 여러 번 완성해 그 스텝의 정확도를 추정.

MLLMMultimodal Large Language Model의 약자. 텍스트뿐 아니라 이미지도 이해하고 답변할 수 있는 대형 언어 모델. GPT-4o, Gemini 같은 모델이 해당.

Related Resources

Original Abstract (Expand)

We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in https://internvl.github.io/blog/2025-03-13-VisualPRM/.