Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Jan 27, 2026•Jialong Wu, Xiaoying Zhang, Hongyi Yuan +7•View PDF

TL;DR Highlight

'Visual CoT' that generates images while reasoning outperforms text-only CoT by up to 26%p on spatial and physics problems

Who Should Read

ML engineers developing VLM-based agents or spatial/physics reasoning systems. Researchers considering whether to leverage image generation capabilities for reasoning in multimodal models.

Core Mechanics

Text-only reasoning (verbal CoT) works well for math and code but fails at spatial/physics problems that even children solve — lacking a visual information pathway
In UMMs (unified text+image generation models), using 'interleaved CoT' that generates images mid-reasoning dramatically improves performance on paper folding, object manipulation, and ball trajectory prediction
Visual world modeling is also 4x more sample-efficient — achieving the same performance with 1/4 the training data compared to text-only approaches
For simple grid problems like mazes and Sokoban, image generation doesn't help — text is already sufficient for tracking 2 coordinates
When probing model internal layers on maze problems, position state is implicitly encoded in hidden representations even without explicit coordinates (emergent implicit world model)
The performance gap between visual CoT and text CoT persists even after RL training — confirming a structural advantage, not a coincidence

Evidence

Paper folding accuracy: visual CoT 39.2% vs text CoT 27.4% vs implicit CoT 21.1% (BAGEL-7B SFT)
Multi-step object manipulation: visual CoT 66.6% vs implicit CoT 40.0% (text CoT omitted due to coordinate representation limitations)
3D cube projection: visual CoT 76.8% vs text CoT 60.2% — text world model's view synthesis fidelity is near 0% while visual achieves 50%+
Even latest commercial models GPT-4o, Gemini 3 Pro, o3 average only 32-60% on VisWorld-Eval — spatial reasoning remains unsolved

How to Apply

For tasks requiring spatial rotation, physics simulation, or multi-view understanding, design pipelines with interleaved image generation mid-reasoning chain — choose a UMM like BAGEL as backbone
For simple path finding or grid puzzles where state can be summarized in a few coordinates, skip image generation and use text CoT alone — saving unnecessary generation cost
Use VisWorld-Eval's 7 tasks (paper folding, object manipulation, ball tracking, cube projection, real-world spatial reasoning, maze, Sokoban) as evaluation criteria to diagnose spatial reasoning bottlenecks in your own models

Code Example

snippet

# VisWorld-Eval style interleaved CoT prompt example (paper folding)

system_prompt = """
You are a multimodal reasoning assistant.
When solving spatial tasks, generate intermediate images to visualize each step.
Use <image> tags to indicate where you would generate an image.
"""

user_prompt = """
An image shows a sheet of paper folded twice with a hole punched through.
Step-by-step, reverse the folding process to find the total number of holes.

Reasoning format:
<think>
1. Analyze the current folded state.
2. Reverse fold 2: [verbal reasoning] → <image> (generate: unfolded state after step 2)
3. Reverse fold 1: [verbal reasoning] → <image> (generate: fully unfolded paper)
4. Count holes.
</think>
Answer: [number]
"""

Terminology

UMMUnified Multimodal Model that can generate both text and images. While existing VLMs only 'read' images, UMMs can also 'write' them.

CoTChain-of-Thought. A technique where the model outputs intermediate reasoning steps before the final answer. Like showing your work when solving problems.

Interleaved CoTA reasoning approach alternating between text reasoning steps and image generation steps. Writing, then drawing, then analyzing again.

World ModelAn AI's internal 'world simulator.' A representation that enables predicting consequences of actions without actually performing them.

SFTSupervised Fine-Tuning. Training the model by showing correct examples (CoT + answers). Like following a recipe step by step.

RLVRReinforcement Learning from Verifiable Rewards. RL that rewards correct answers and penalizes wrong ones. Suitable for tasks with checkable answers like math/code.

World ReconstructionThe ability to reconstruct the full 3D structure from partial observations (a few views). Like imagining the back from seeing only front and side photos.

MOMDPMulti-Observable Markov Decision Process. A mathematical framework where the same world state can be observed from multiple perspectives (text, images, etc.).

Related Resources

Original Abstract (Expand)

Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks--particularly those grounded in the physical world--visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Controlled experiments on a state-of-the-art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human-like multimodal AI.