Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
TL;DR Highlight
Chain-of-Thought reasoning decreases accuracy across 17 models on image-based spatial reasoning tasks.
Who Should Read
ML engineers developing services that analyze spatial relationships (location, direction, distance, etc.) within images using multimodal LLMs, or AI application developers who default to CoT prompting.
Core Mechanics
- CoT (Chain-of-Thought) prompting reduces accuracy by an average of 3% in visual spatial reasoning, unlike its performance in math and logic tasks.
- 7 out of 8 Multimodal Reasoning Models (MRMs) trained with Reinforcement Learning (RL) performed *worse* at spatial reasoning than their base Qwen2.5-VL-7B-Instruct model—expensive training can be counterproductive.
- ViGoRL-7B-Spatial, specifically trained for spatial reasoning, also underperformed its base model by −2%, and TreeVGR by −1.57%. Vision-G1(+0.6%) was the sole exception.
- The No-Image++ experiment—replacing images with a gray screen and adding a 'cannot determine from the image' option—showed MRMs confidently selecting incorrect answers based on textual knowledge alone, even fabricating spatial coordinates.
- GPT-5 and GPT-5-nano also show +0.65% and +1.23% higher accuracy with Non-CoT prompting compared to CoT, mirroring the trend observed in open-source models. GPT-4o and GPT-4.1-mini show minimal CoT gains (under 0.5%) that don't justify the added inference cost.
- Models with concise, non-repetitive CoT traces (GPT family, ~350 characters) experience less performance degradation than open-source models with lengthy, looping traces (~3600 characters). Verbose reasoning is suspected to induce hallucinations.
Evidence
- Across 17 models and 13 spatial benchmarks, CoT prompting resulted in an average 3% accuracy decrease compared to Non-CoT. Qwen2.5-VL-7B: Non-CoT 62.68% vs CoT 59.68%.
- GThinker-7B experienced the largest drop (−23.14%) when using Non-CoT prompts, repeatedly outputting `tool_call` tokens to the maximum token limit instead of following instructions.
- In the No-Image++ experiment, MRM accuracy for selecting 'cannot determine from the image' was: GThinker 5.55%, R1-Onevision 11.22%, Vision-R1 7.29%—below random chance. The base Qwen2.5-VL-7B achieved 76.41%.
- Qwen3-VL-8B-Thinking (a model enhanced for spatial awareness) showed Non-CoT outperforming CoT on 8 out of 13 datasets, with an average difference of +0.64%.
How to Apply
- When processing 'object location/direction/distance' questions in multimodal apps, switch from CoT prompts to direct-answer prompts (Non-CoT). For example, configure the system prompt without 'think' tags: 'You are a spatial-reasoning assistant. Answer the question directly.'
- If your spatial reasoning pipeline uses CoT models (GThinker, ViGoRL, etc. MRMs), compare their performance to the same base model (Qwen2.5-VL-7B-Instruct) running Non-CoT. This can reduce both cost and latency.
- To reduce model hallucinations in visually-critical functions (e.g., robot navigation, object relationship extraction in images), incorporate a No-Image++-style internal reliability test into your QA pipeline—input a blank image and include a 'cannot determine' option.
Code Example
# Non-CoT prompt example (spatial reasoning task)
base_system_prompt = "You are a spatial-reasoning assistant. The user asks a question, and the Assistant solves it."
# CoT prompt (do not use - performance degradation in spatial tasks)
cot_system_prompt = (
"You are a spatial-reasoning assistant. "
"First output the thinking process in <think></think> tags "
"and then output the final answer in <answer></answer> tags."
)
# Recommended: Call directly with Non-CoT
messages = [
{"role": "system", "content": base_system_prompt},
{
"role": "user",
"content": [
{"type": "image", "image": "<your_image_path_or_url>"},
{"type": "text", "text": "Where is the red box relative to the blue sphere?\nOptions:\nA. Left\nB. Right\nC. Above\nD. Below\nPlease select the correct answer (letter and option text) from the options above."}
]
}
]
# No-Image++ reliability test: blank image + 'cannot determine' option
def add_cannot_determine_option(options: list[str]) -> list[str]:
return options + ["Cannot determine from the image"]
# If the model doesn't choose 'cannot determine' with a blank image → hallucination risk signalTerminology
Related Resources
Original Abstract (Expand)
Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.