Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

Mar 13, 2026•Wayner Barrios, SouYoung Jin•View PDF

TL;DR Highlight

CRYSTAL benchmark: step-by-step verification of whether multimodal AI models' reasoning processes are actually correct, even when they get the right answer.

Who Should Read

Researchers evaluating multimodal reasoning quality, and teams that need to distinguish genuine reasoning ability from lucky correct answers.

Core Mechanics

Introduced CRYSTAL: a benchmark that evaluates the correctness of reasoning steps, not just final answer accuracy
Many multimodal models get correct final answers through flawed or spurious reasoning chains
CRYSTAL provides step-by-step annotations that allow verification of each reasoning step independently
Models that score well on answer accuracy can score poorly on reasoning quality — exposing reasoning shortcuts
The benchmark covers visual reasoning tasks where intermediate steps can be objectively verified
Analysis reveals consistent patterns of 'lucky correct' answers in current multimodal models

Evidence

Significant gap between answer accuracy and reasoning accuracy on CRYSTAL for major multimodal models
Models with >70% answer accuracy show <50% correct reasoning traces in many categories
Human performance shows much tighter coupling between answer accuracy and reasoning quality
CRYSTAL reveals reasoning shortcuts that are invisible to answer-accuracy-only evaluation

How to Apply

Use CRYSTAL to diagnose whether your multimodal model is actually reasoning or pattern-matching to answers
Low reasoning accuracy on CRYSTAL with high answer accuracy indicates the model is using shortcuts that will fail on out-of-distribution inputs
For training: use CRYSTAL's step annotations as a training signal to improve intermediate reasoning quality (process reward models)

Code Example

snippet

# CRYSTAL evaluation prompt (request model to output in this format)
system_prompt = """
You are a vision-language model. Analyze the provided image(s) and user text silently.
Return ONLY a valid JSON object with this schema:
{"reasoning_steps": [], "answer": ""}

Rules for "reasoning_steps":
- Include enough steps to make the answer evident without filler.
- Write single-clause sentences, each adding a new, directly checkable fact.
- No multi-sentence items. No internal monologue.

Rules for "answer":
- Ground strictly in visible content and given text.
- Multiple-choice: return only the best LETTER (e.g., "B").
- Numeric: include units.
"""

# Match F1 calculation example (using sentence-transformers)
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('all-distilroberta-v1')
THRESHOLD = 0.35

def compute_match_f1(predicted_steps, reference_steps):
    if not predicted_steps and not reference_steps:
        return 1.0
    if not predicted_steps or not reference_steps:
        return 0.0
    
    pred_emb = model.encode(predicted_steps)
    ref_emb = model.encode(reference_steps)
    sim_matrix = cosine_similarity(pred_emb, ref_emb)
    
    # Greedy 1:1 matching
    matched_pred, matched_ref = set(), set()
    pairs = [(sim_matrix[i,j], i, j) 
             for i in range(len(predicted_steps)) 
             for j in range(len(reference_steps)) 
             if sim_matrix[i,j] >= THRESHOLD]
    pairs.sort(reverse=True)
    
    for score, i, j in pairs:
        if i not in matched_pred and j not in matched_ref:
            matched_pred.add(i)
            matched_ref.add(j)
    
    tp = len(matched_pred)
    precision = tp / max(len(predicted_steps), 1)
    recall = tp / max(len(reference_steps), 1)
    
    if precision + recall == 0:
        return 0.0
    return 2 * precision * recall / (precision + recall)

# CPR (Causal Process Reward) calculation
def compute_cpr(answer_correct, f1_step, aw=0.65, sw=0.35, lambda_penalty=0.3):
    if answer_correct:
        return aw * 1.0 + sw * f1_step
    else:
        return sw * f1_step * lambda_penalty

Terminology

Process Reward ModelA reward model that evaluates the quality of reasoning steps rather than just the final answer — enables better reasoning training.

Reasoning ShortcutsPatterns where a model arrives at the correct answer through flawed or incomplete reasoning that happens to work on test distributions.

Step-by-step AnnotationLabeling each intermediate reasoning step as correct or incorrect, rather than only grading the final answer.

Multimodal ReasoningThe ability to combine visual and textual information to reason about complex multi-step problems.

Original Abstract (Expand)

We introduce **CRYSTAL** (*__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic*), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: *Match F1*, which scores step-level precision and recall via semantic similarity matching, and *Ordered Match F1*, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline where four independent MLLMs generate trajectories, aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures invisible to accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning where no competitive model preserves more than 60% of matched steps in correct order. Beyond evaluation, we propose the **Causal Process Reward (CPR)**, a multiplicative reward that couples answer correctness with step-level alignment, and **CPR-Curriculum**, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves +32% Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.