Linking Perception, Confidence and Accuracy in MLLMs
TL;DR Highlight
Found a bug where multimodal LLMs stay overconfident even with blurry images, fixed it with RL, and built a Test-Time Scaling framework on top of it.
Who Should Read
Researchers working on multimodal LLM reliability and calibration, and teams building vision-language systems where image quality varies (medical imaging, surveillance, etc.).
Core Mechanics
- Discovered that multimodal LLMs remain confidently wrong when input images are degraded (blurry, low-res, corrupted) — they don't express uncertainty
- This overconfidence bug persists across major multimodal models tested
- Used RL with uncertainty-aware reward signals to teach models to express appropriate confidence given image quality
- Built a Test-Time Scaling framework that allocates more computation to uncertain cases (when the model correctly identifies uncertainty)
- The calibrated uncertainty signals enable better routing: high-confidence goes fast path, low-confidence gets more inference compute
- Post-RL models show dramatically improved calibration on degraded images while maintaining accuracy on clear images
Evidence
- Pre-fix models maintain high confidence on blurry images where ground truth accuracy drops significantly
- RL-based fix improves calibration metrics (ECE, reliability diagrams) substantially
- Test-Time Scaling with uncertainty routing improves overall task accuracy vs flat compute allocation
- The fix generalizes across image degradation types (blur, noise, compression artifacts)
How to Apply
- If your vision-language pipeline gets variable-quality images, check calibration on degraded samples — you may have silent overconfidence bugs
- Apply the RL calibration fine-tune to teach models uncertainty-awareness without losing accuracy on clean inputs
- Use the uncertainty signal for Test-Time Scaling: route uncertain predictions to a slower, more careful inference path
Code Example
# CA-TTS Self-Consistency + Confidence Voting core logic example
import numpy as np
def compute_confidence(logprobs: list[float]) -> float:
"""NMLP-based confidence calculation (lower = more certain)"""
return -np.mean(logprobs) # Negative Mean Log-Probability
def confidence_weighted_voting(samples: list[dict]) -> dict:
"""
samples: [{'answer': 'A', 'logprobs': [...], 'confidence': float}, ...]
"""
vote_dict = {}
for s in samples:
ans = s['answer']
conf = s['confidence']
# Low NMLP = high certainty → used as weight
weight = 1.0 / (conf + 1e-8)
vote_dict[ans] = vote_dict.get(ans, 0) + weight
return vote_dict
# Critic Expert Prompt (Self-Reflection stage)
CRITIC_PROMPT = """
Given the following information:
Image: {image}
Question: {question}
Model Answer: {model_answer}
Model Confidence: {confidence}
Please generate a self-reflection critique.
Critique: Based on this question, your answer is "{model_answer}",
<fill in concise critique here>
"""
# Voter Expert Prompt (Self-Consistency stage)
VOTER_PROMPT = """
Image: {image}
Question: {question}
Candidate options: {options_list}
Generate normalized confidence (probability) for each option.
Sum must equal 1. Output ONLY the array:
[p_1, p_2, ..., p_n]
"""
# Full CA-TTS flow (pseudo)
def ca_tts(image, question, base_model, expert_model, n_samples=8):
# 1. Generate n samples + calculate confidence
samples = []
for _ in range(n_samples):
output, logprobs = base_model.generate(image, question)
conf = compute_confidence(logprobs)
samples.append({'answer': output, 'confidence': conf})
# 2. Self-Consistency: confidence weighted voting
vote_dict = confidence_weighted_voting(samples)
# 3. Expert Voter external verification
candidates = list(set(s['answer'] for s in samples))
expert_probs = expert_model.vote(image, question, candidates, VOTER_PROMPT)
tau1 = 0.5
for k, p in zip(candidates, expert_probs):
vote_dict[k] = vote_dict.get(k, 0) + tau1 * p
# 4. Self-Reflection: correct low confidence responses
low_conf_sample = max(samples, key=lambda x: x['confidence']) # High NMLP = low certainty
critique = expert_model.critique(image, question, low_conf_sample, CRITIC_PROMPT)
reflected_answer, _ = base_model.generate(image, question, context=critique)
tau2 = 0.5
vote_dict[reflected_answer] = vote_dict.get(reflected_answer, 0) + tau2
# 5. Return the answer with the most votes
return max(vote_dict, key=vote_dict.get)Terminology
Related Resources
Original Abstract (Expand)
Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.