Confidence over Time: Confidence Calibration with Temporal Logic for Large Language Model Reasoning
TL;DR Highlight
A method that predicts 'this answer is likely wrong' far more accurately by analyzing how LLM confidence changes step-by-step during reasoning as a time series
Who Should Read
ML engineers or researchers working on hallucination detection and answer confidence scoring in LLM-based services. Especially developers building logic to determine whether to regenerate in math/science reasoning pipelines.
Core Mechanics
- Don't collapse LLM confidence into a single number — treat it as a step-by-step time series signal. Existing methods are miscalibrated due to response length and writing style influences
- Uses Signal Temporal Logic (STL) to automatically discover patterns of 'correct answer confidence flows' vs 'wrong answer patterns' from data — e.g., SharpDrop (sudden decline) or EndLow (low at final step) signals danger
- Wrong-answer STL patterns are reusable across different tasks (Qwen3-8B Jaccard similarity 0.811), while correct-answer patterns vary significantly by task — failure patterns are more universal than success patterns
- Even with the same STL structure, optimal parameters (thresholds etc.) vary significantly per question, requiring a hypernetwork (small auxiliary network) that takes questions as input and predicts STL parameters
- Outperforms existing self-consistency, SAR, Self-Eval methods across 3 models × 4 benchmarks on ECE and Brier Score
- Average inference time 0.55s/example, much faster than multi-sampling self-consistency while achieving better calibration from a single sample
Evidence
- Wrong-answer STL pattern cross-task Jaccard similarity: Qwen3 0.811, Gemma3 0.789, Llama 0.744 — much higher than correct-answer patterns (0.47-0.55), confirming universal reusability
- CLadder benchmark Qwen3 ECE: AveLogit 0.200 → Ours 0.035, Self-Consistency 0.223 → Ours 0.035, ~6x improvement
- BBH benchmark Qwen3 ECE: AveLogit 0.339 → Ours 0.052, Self-Eval 0.197 → Ours 0.052
- STL template count experiment: 10 templates (5+5) gives ECE 0.020, AUROC 0.617 as optimal — too few or too many degrades performance
How to Apply
- Split CoT responses into sentence/step segments, compute average token probability per segment as a confidence time series, then flag low confidence or trigger regeneration when SharpDrop/EndLow/Recovery patterns are detected
- Wrong-answer STL patterns are reusable across tasks, so pre-mine negative STL patterns from public datasets like BBH and directly transplant them as confidence filters in other domain services
- Extract logprobs from open-source LLMs (Qwen3-8B, Llama-3-8B, etc.) and feed them into a lightweight STL-based confidence scorer to pre-filter hallucination-risk answers in RAG pipeline answer verification or agent self-review stages
Code Example
import numpy as np
def extract_step_confidence(token_probs: list[float], step_boundaries: list[tuple]) -> np.ndarray:
"""Average token probability per reasoning step → confidence time series"""
step_conf = []
for start, end in step_boundaries:
chunk = [p for p in token_probs[start:end] if p > 0]
step_conf.append(np.mean(chunk) if chunk else 0.0)
return np.array(step_conf)
def stl_end_low(signal: np.ndarray, k: int = 2, mu: float = 0.7) -> float:
"""EndLow pattern: last k steps are below mu (negative = no violation, positive = risk detected)"""
last_k = signal[-k:]
return float(np.min(mu - last_k)) # > 0 means EndLow pattern detected
def stl_sharp_drop(signal: np.ndarray, epsilon: float = 0.1) -> float:
"""SharpDrop pattern: a drop of epsilon or more occurred (positive = detected)"""
diffs = np.diff(signal)
return float(np.max(-diffs - epsilon)) # > 0 means SharpDrop detected
def is_likely_incorrect(token_probs, step_boundaries, mu=0.7, epsilon=0.1) -> bool:
signal = extract_step_confidence(token_probs, step_boundaries)
end_low_score = stl_end_low(signal, k=2, mu=mu)
sharp_drop_score = stl_sharp_drop(signal, epsilon=epsilon)
return end_low_score > 0 or sharp_drop_score > 0
# Usage example (token probabilities for a 3-step CoT response)
token_probs = [0.92, 0.88, 0.91, 0.85, 0.60, 0.55, 0.52]
step_boundaries = [(0, 2), (2, 4), (4, 7)] # Token index range per step
signal = extract_step_confidence(token_probs, step_boundaries)
print(f"Step confidence: {signal}") # [0.90, 0.88, 0.556]
print(f"SharpDrop: {stl_sharp_drop(signal):.3f}") # > 0 → sharp drop detected
print(f"EndLow: {stl_end_low(signal):.3f}") # > 0 → low at the end
print(f"Regeneration needed: {is_likely_incorrect(token_probs, step_boundaries)}") # TrueTerminology
Original Abstract (Expand)
Large Language Models (LLMs) increasingly rely on long-form, multi-step reasoning to solve complex tasks such as mathematical problem solving and scientific question answering. Despite strong performance, existing confidence estimation methods typically reduce an entire reasoning process to a single scalar score, ignoring how confidence evolves throughout the generation. As a result, these methods are often sensitive to superficial factors such as response length or verbosity, and struggle to distinguish correct reasoning from confidently stated errors. We propose to characterize the stepwise confidence signal using Signal Temporal Logic (STL). Using a discriminative STL mining procedure, we discover temporal formulas that distinguish confidence signals of correct and incorrect responses. Our analysis found that the STL patterns generalize across tasks, and numeric parameters exhibit sensitivity to individual questions. Based on these insights, we develop a confidence estimation approach that informs STL blocks with parameter hypernetworks. Experiments on multiple reasoning tasks show our confidence scores are more calibrated than the baselines.