Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation
TL;DR Highlight
CRYSTAL benchmark: step-by-step verification of whether multimodal AI models' reasoning processes are actually correct, even when they get the right answer.
Who Should Read
Researchers evaluating multimodal reasoning quality, and teams that need to distinguish genuine reasoning ability from lucky correct answers.
Core Mechanics
- Introduced CRYSTAL: a benchmark that evaluates the correctness of reasoning steps, not just final answer accuracy
- Many multimodal models get correct final answers through flawed or spurious reasoning chains
- CRYSTAL provides step-by-step annotations that allow verification of each reasoning step independently
- Models that score well on answer accuracy can score poorly on reasoning quality — exposing reasoning shortcuts
- The benchmark covers visual reasoning tasks where intermediate steps can be objectively verified
- Analysis reveals consistent patterns of 'lucky correct' answers in current multimodal models
Evidence
- Significant gap between answer accuracy and reasoning accuracy on CRYSTAL for major multimodal models
- Models with >70% answer accuracy show <50% correct reasoning traces in many categories
- Human performance shows much tighter coupling between answer accuracy and reasoning quality
- CRYSTAL reveals reasoning shortcuts that are invisible to answer-accuracy-only evaluation
How to Apply
- Use CRYSTAL to diagnose whether your multimodal model is actually reasoning or pattern-matching to answers
- Low reasoning accuracy on CRYSTAL with high answer accuracy indicates the model is using shortcuts that will fail on out-of-distribution inputs
- For training: use CRYSTAL's step annotations as a training signal to improve intermediate reasoning quality (process reward models)
Code Example
# CRYSTAL evaluation prompt (request model to output in this format)
system_prompt = """
You are a vision-language model. Analyze the provided image(s) and user text silently.
Return ONLY a valid JSON object with this schema:
{"reasoning_steps": [], "answer": ""}
Rules for "reasoning_steps":
- Include enough steps to make the answer evident without filler.
- Write single-clause sentences, each adding a new, directly checkable fact.
- No multi-sentence items. No internal monologue.
Rules for "answer":
- Ground strictly in visible content and given text.
- Multiple-choice: return only the best LETTER (e.g., "B").
- Numeric: include units.
"""
# Match F1 calculation example (using sentence-transformers)
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
model = SentenceTransformer('all-distilroberta-v1')
THRESHOLD = 0.35
def compute_match_f1(predicted_steps, reference_steps):
if not predicted_steps and not reference_steps:
return 1.0
if not predicted_steps or not reference_steps:
return 0.0
pred_emb = model.encode(predicted_steps)
ref_emb = model.encode(reference_steps)
sim_matrix = cosine_similarity(pred_emb, ref_emb)
# Greedy 1:1 matching
matched_pred, matched_ref = set(), set()
pairs = [(sim_matrix[i,j], i, j)
for i in range(len(predicted_steps))
for j in range(len(reference_steps))
if sim_matrix[i,j] >= THRESHOLD]
pairs.sort(reverse=True)
for score, i, j in pairs:
if i not in matched_pred and j not in matched_ref:
matched_pred.add(i)
matched_ref.add(j)
tp = len(matched_pred)
precision = tp / max(len(predicted_steps), 1)
recall = tp / max(len(reference_steps), 1)
if precision + recall == 0:
return 0.0
return 2 * precision * recall / (precision + recall)
# CPR (Causal Process Reward) calculation
def compute_cpr(answer_correct, f1_step, aw=0.65, sw=0.35, lambda_penalty=0.3):
if answer_correct:
return aw * 1.0 + sw * f1_step
else:
return sw * f1_step * lambda_penaltyTerminology
Original Abstract (Expand)
We introduce **CRYSTAL** (*__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic*), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: *Match F1*, which scores step-level precision and recall via semantic similarity matching, and *Ordered Match F1*, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline where four independent MLLMs generate trajectories, aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures invisible to accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning where no competitive model preserves more than 60% of matched steps in correct order. Beyond evaluation, we propose the **Causal Process Reward (CPR)**, a multiplicative reward that couples answer correctness with step-level alignment, and **CPR-Curriculum**, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves +32% Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.