멀티모달 World Model을 통한 인간 수준 추론: 시각적 생성의 역할

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Jan 27, 2026•Jialong Wu, Xiaoying Zhang, Hongyi Yuan +7•View PDF

TL;DR Highlight

이미지를 생성하면서 추론하는 '시각적 CoT'가 공간·물리 문제에서 텍스트만 쓰는 CoT보다 최대 26%p 이상 성능이 높다.

Who Should Read

VLM 기반 에이전트나 공간·물리 추론 시스템을 개발하는 ML 엔지니어. 멀티모달 모델에서 이미지 생성 기능을 추론에 활용할지 고민 중인 연구자.

Core Mechanics

텍스트 추론(verbal CoT)만 쓰는 모델은 수학·코드는 잘 하지만, 공간·물리 문제는 어린아이도 아는 걸 못 함 — 시각 정보 경로가 없어서
UMM(텍스트+이미지 생성 통합 모델)에서 추론 중간에 이미지를 생성하는 'interleaved CoT'를 쓰면 종이 접기·물체 조작·공 궤적 예측에서 대폭 성능 향상
시각적 World Modeling은 샘플 효율도 4배 — 같은 성능을 내는 데 텍스트 방식보다 훈련 데이터가 1/4만 있어도 됨
미로(Maze)·소코반 같은 단순 격자 문제에서는 이미지 생성이 도움이 안 됨 — 텍스트로 좌표 2개 추적하는 게 이미 충분하기 때문
미로 문제에서 모델 내부 레이어를 프로빙하면, 명시적 좌표 없이도 위치 상태가 hidden representation에 암묵적으로 인코딩돼 있음(emergent implicit world model)
RL(강화학습) 추가 학습 후에도 시각 CoT와 텍스트 CoT 간 성능 격차는 유지 — 우연한 차이가 아니라 구조적 이점

Evidence

종이 접기 정확도: 시각 CoT 39.2% vs 텍스트 CoT 27.4% vs 암묵적 CoT 21.1% (BAGEL-7B SFT 기준)
물체 다단계 조작: 시각 CoT 66.6% vs 암묵적 CoT 40.0% (텍스트 CoT는 좌표 표현 한계로 생략)
3D 큐브 투영: 시각 CoT 76.8% vs 텍스트 CoT 60.2% — 텍스트 World Model의 뷰 합성 충실도(fidelity)는 거의 0%인 반면 시각은 50% 이상
GPT-4o·Gemini 3 Pro·o3 등 최신 상용 모델도 VisWorld-Eval에서 평균 32~60% 수준 — 공간 추론은 여전히 미해결 문제

How to Apply

공간 회전·물리 시뮬레이션·멀티뷰 이해가 필요한 태스크라면, 추론 체인 중간에 이미지를 생성(interleaved CoT)하는 방식으로 파이프라인 설계 — BAGEL 같은 UMM을 백본으로 선택
단순 경로 탐색·격자 퍼즐처럼 상태가 좌표 몇 개로 요약되는 경우엔 이미지 생성 스킵, 텍스트 CoT만 써도 충분 — 불필요한 생성 비용 절감 가능
VisWorld-Eval 7가지 태스크(종이 접기, 물체 조작, 공 추적, 큐브 투영, 실세계 공간 추론, 미로, 소코반)를 평가 기준으로 삼아 자체 모델의 공간 추론 병목 지점을 진단

Code Example

snippet

# VisWorld-Eval 스타일 interleaved CoT 프롬프트 예시 (종이 접기)

system_prompt = """
You are a multimodal reasoning assistant.
When solving spatial tasks, generate intermediate images to visualize each step.
Use <image> tags to indicate where you would generate an image.
"""

user_prompt = """
An image shows a sheet of paper folded twice with a hole punched through.
Step-by-step, reverse the folding process to find the total number of holes.

Reasoning format:
<think>
1. Analyze the current folded state.
2. Reverse fold 2: [verbal reasoning] → <image> (generate: unfolded state after step 2)
3. Reverse fold 1: [verbal reasoning] → <image> (generate: fully unfolded paper)
4. Count holes.
</think>
Answer: [number]
"""

Terminology

UMM텍스트와 이미지를 둘 다 생성할 수 있는 통합 멀티모달 모델. 기존 VLM이 이미지를 '읽기'만 했다면, UMM은 이미지를 '쓰기'도 함.

CoTChain-of-Thought. 모델이 최종 답 전에 중간 추론 과정을 단계별로 출력하는 기법. 사람이 문제 풀 때 풀이 과정 적는 것과 같음.

Interleaved CoT텍스트 추론 단계와 이미지 생성 단계를 번갈아 가며 섞어 쓰는 추론 방식. 글로 설명하다가 그림 그리고, 다시 글로 분석하는 식.

World ModelAI가 내부적으로 갖고 있는 '세계 시뮬레이터'. 행동의 결과를 실제로 해보지 않고 머릿속에서 예측할 수 있게 해주는 표현.

SFTSupervised Fine-Tuning. 정답 예시(CoT+정답)를 보여주며 모델을 직접 학습시키는 방식. 요리 레시피 보고 따라 만드는 것과 유사.

RLVRReinforcement Learning from Verifiable Rewards. 정답이 맞으면 보상, 틀리면 패널티를 주는 강화학습. 수학/코드처럼 채점 가능한 태스크에 적합.

World Reconstruction부분 관측(몇 개의 뷰)에서 전체 3D 구조를 머릿속에서 재구성하는 능력. 앞·옆 사진만 보고 뒷모습을 상상하는 것과 같음.

MOMDPMulti-Observable Markov Decision Process. 같은 세계 상태를 여러 관점(텍스트·이미지 등)으로 관찰할 수 있다는 수학적 프레임워크.

Related Resources

Original Abstract (Expand)

Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks--particularly those grounded in the physical world--visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Controlled experiments on a state-of-the-art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human-like multimodal AI.