Linking Perception, Confidence and Accuracy in MLLMs

Mar 12, 2026•Yuetian Du, Yucheng Wang, Rongyu Zhang +5•View PDF

TL;DR Highlight

Found a bug where multimodal LLMs stay overconfident even with blurry images, fixed it with RL, and built a Test-Time Scaling framework on top of it.

Who Should Read

Researchers working on multimodal LLM reliability and calibration, and teams building vision-language systems where image quality varies (medical imaging, surveillance, etc.).

Core Mechanics

Discovered that multimodal LLMs remain confidently wrong when input images are degraded (blurry, low-res, corrupted) — they don't express uncertainty
This overconfidence bug persists across major multimodal models tested
Used RL with uncertainty-aware reward signals to teach models to express appropriate confidence given image quality
Built a Test-Time Scaling framework that allocates more computation to uncertain cases (when the model correctly identifies uncertainty)
The calibrated uncertainty signals enable better routing: high-confidence goes fast path, low-confidence gets more inference compute
Post-RL models show dramatically improved calibration on degraded images while maintaining accuracy on clear images

Evidence

Pre-fix models maintain high confidence on blurry images where ground truth accuracy drops significantly
RL-based fix improves calibration metrics (ECE, reliability diagrams) substantially
Test-Time Scaling with uncertainty routing improves overall task accuracy vs flat compute allocation
The fix generalizes across image degradation types (blur, noise, compression artifacts)

How to Apply

If your vision-language pipeline gets variable-quality images, check calibration on degraded samples — you may have silent overconfidence bugs
Apply the RL calibration fine-tune to teach models uncertainty-awareness without losing accuracy on clean inputs
Use the uncertainty signal for Test-Time Scaling: route uncertain predictions to a slower, more careful inference path

Code Example

snippet

# CA-TTS Self-Consistency + Confidence Voting core logic example
import numpy as np

def compute_confidence(logprobs: list[float]) -> float:
    """NMLP-based confidence calculation (lower = more certain)"""
    return -np.mean(logprobs)  # Negative Mean Log-Probability

def confidence_weighted_voting(samples: list[dict]) -> dict:
    """
    samples: [{'answer': 'A', 'logprobs': [...], 'confidence': float}, ...]
    """
    vote_dict = {}
    for s in samples:
        ans = s['answer']
        conf = s['confidence']
        # Low NMLP = high certainty → used as weight
        weight = 1.0 / (conf + 1e-8)
        vote_dict[ans] = vote_dict.get(ans, 0) + weight
    return vote_dict

# Critic Expert Prompt (Self-Reflection stage)
CRITIC_PROMPT = """
Given the following information:
Image: {image}
Question: {question}
Model Answer: {model_answer}
Model Confidence: {confidence}

Please generate a self-reflection critique.
Critique: Based on this question, your answer is "{model_answer}", 
<fill in concise critique here>
"""

# Voter Expert Prompt (Self-Consistency stage)
VOTER_PROMPT = """
Image: {image}
Question: {question}
Candidate options: {options_list}

Generate normalized confidence (probability) for each option.
Sum must equal 1. Output ONLY the array:
[p_1, p_2, ..., p_n]
"""

# Full CA-TTS flow (pseudo)
def ca_tts(image, question, base_model, expert_model, n_samples=8):
    # 1. Generate n samples + calculate confidence
    samples = []
    for _ in range(n_samples):
        output, logprobs = base_model.generate(image, question)
        conf = compute_confidence(logprobs)
        samples.append({'answer': output, 'confidence': conf})
    
    # 2. Self-Consistency: confidence weighted voting
    vote_dict = confidence_weighted_voting(samples)
    
    # 3. Expert Voter external verification
    candidates = list(set(s['answer'] for s in samples))
    expert_probs = expert_model.vote(image, question, candidates, VOTER_PROMPT)
    tau1 = 0.5
    for k, p in zip(candidates, expert_probs):
        vote_dict[k] = vote_dict.get(k, 0) + tau1 * p
    
    # 4. Self-Reflection: correct low confidence responses
    low_conf_sample = max(samples, key=lambda x: x['confidence'])  # High NMLP = low certainty
    critique = expert_model.critique(image, question, low_conf_sample, CRITIC_PROMPT)
    reflected_answer, _ = base_model.generate(image, question, context=critique)
    tau2 = 0.5
    vote_dict[reflected_answer] = vote_dict.get(reflected_answer, 0) + tau2
    
    # 5. Return the answer with the most votes
    return max(vote_dict, key=vote_dict.get)

Terminology

CalibrationHow well a model's confidence scores match its actual accuracy. A calibrated model that says 80% confident is right 80% of the time.

Test-Time ScalingAllocating more compute (more tokens, more sampling, longer thinking) at inference time for harder or more uncertain inputs.

RLHFReinforcement Learning from Human Feedback — here adapted to use uncertainty-aware reward signals rather than human preference labels.

Expected Calibration Error (ECE)A metric measuring the average gap between predicted confidence and actual accuracy across confidence bins.

Related Resources

CA-TTS GitHub Repository

Original Abstract (Expand)

Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.