Confidence over Time: Confidence Calibration with Temporal Logic for Large Language Model Reasoning

Jan 19, 2026•Zhenjiang Mao, Anirudhh Venkat, Artem Bisliouk +4•View PDF

TL;DR Highlight

A method that predicts 'this answer is likely wrong' far more accurately by analyzing how LLM confidence changes step-by-step during reasoning as a time series

Who Should Read

ML engineers or researchers working on hallucination detection and answer confidence scoring in LLM-based services. Especially developers building logic to determine whether to regenerate in math/science reasoning pipelines.

Core Mechanics

Don't collapse LLM confidence into a single number — treat it as a step-by-step time series signal. Existing methods are miscalibrated due to response length and writing style influences
Uses Signal Temporal Logic (STL) to automatically discover patterns of 'correct answer confidence flows' vs 'wrong answer patterns' from data — e.g., SharpDrop (sudden decline) or EndLow (low at final step) signals danger
Wrong-answer STL patterns are reusable across different tasks (Qwen3-8B Jaccard similarity 0.811), while correct-answer patterns vary significantly by task — failure patterns are more universal than success patterns
Even with the same STL structure, optimal parameters (thresholds etc.) vary significantly per question, requiring a hypernetwork (small auxiliary network) that takes questions as input and predicts STL parameters
Outperforms existing self-consistency, SAR, Self-Eval methods across 3 models × 4 benchmarks on ECE and Brier Score
Average inference time 0.55s/example, much faster than multi-sampling self-consistency while achieving better calibration from a single sample

Evidence

Wrong-answer STL pattern cross-task Jaccard similarity: Qwen3 0.811, Gemma3 0.789, Llama 0.744 — much higher than correct-answer patterns (0.47-0.55), confirming universal reusability
CLadder benchmark Qwen3 ECE: AveLogit 0.200 → Ours 0.035, Self-Consistency 0.223 → Ours 0.035, ~6x improvement
BBH benchmark Qwen3 ECE: AveLogit 0.339 → Ours 0.052, Self-Eval 0.197 → Ours 0.052
STL template count experiment: 10 templates (5+5) gives ECE 0.020, AUROC 0.617 as optimal — too few or too many degrades performance

How to Apply

Split CoT responses into sentence/step segments, compute average token probability per segment as a confidence time series, then flag low confidence or trigger regeneration when SharpDrop/EndLow/Recovery patterns are detected
Wrong-answer STL patterns are reusable across tasks, so pre-mine negative STL patterns from public datasets like BBH and directly transplant them as confidence filters in other domain services
Extract logprobs from open-source LLMs (Qwen3-8B, Llama-3-8B, etc.) and feed them into a lightweight STL-based confidence scorer to pre-filter hallucination-risk answers in RAG pipeline answer verification or agent self-review stages

Code Example

snippet

import numpy as np

def extract_step_confidence(token_probs: list[float], step_boundaries: list[tuple]) -> np.ndarray:
    """Average token probability per reasoning step → confidence time series"""
    step_conf = []
    for start, end in step_boundaries:
        chunk = [p for p in token_probs[start:end] if p > 0]
        step_conf.append(np.mean(chunk) if chunk else 0.0)
    return np.array(step_conf)

def stl_end_low(signal: np.ndarray, k: int = 2, mu: float = 0.7) -> float:
    """EndLow pattern: last k steps are below mu (negative = no violation, positive = risk detected)"""
    last_k = signal[-k:]
    return float(np.min(mu - last_k))  # > 0 means EndLow pattern detected

def stl_sharp_drop(signal: np.ndarray, epsilon: float = 0.1) -> float:
    """SharpDrop pattern: a drop of epsilon or more occurred (positive = detected)"""
    diffs = np.diff(signal)
    return float(np.max(-diffs - epsilon))  # > 0 means SharpDrop detected

def is_likely_incorrect(token_probs, step_boundaries, mu=0.7, epsilon=0.1) -> bool:
    signal = extract_step_confidence(token_probs, step_boundaries)
    end_low_score = stl_end_low(signal, k=2, mu=mu)
    sharp_drop_score = stl_sharp_drop(signal, epsilon=epsilon)
    return end_low_score > 0 or sharp_drop_score > 0

# Usage example (token probabilities for a 3-step CoT response)
token_probs = [0.92, 0.88, 0.91, 0.85, 0.60, 0.55, 0.52]
step_boundaries = [(0, 2), (2, 4), (4, 7)]  # Token index range per step

signal = extract_step_confidence(token_probs, step_boundaries)
print(f"Step confidence: {signal}")  # [0.90, 0.88, 0.556]
print(f"SharpDrop: {stl_sharp_drop(signal):.3f}")   # > 0 → sharp drop detected
print(f"EndLow: {stl_end_low(signal):.3f}")          # > 0 → low at the end
print(f"Regeneration needed: {is_likely_incorrect(token_probs, step_boundaries)}")  # True

Terminology

STL (Signal Temporal Logic)A formal language from robot control that 'places conditions on time intervals.' Can express things like 'confidence must stay above 0.8 for the first 3 steps.'

ECE (Expected Calibration Error)Measures whether a model saying '80% confident' actually gets it right 80% of the time. Lower means confidence matches actual accuracy better.

CalibrationThe concept of how well a model's probability scores match reality. A 'well-calibrated model' at 70% probability actually gets it right 7 out of 10 times. Overconfident or too cautious means miscalibration.

HypernetworkA small network that outputs parameters for another network. Here, an auxiliary network predicting 'what should the STL threshold be for this question,' dynamically adjusting confidence criteria per question.

Robustness ScoreA real number expressing how comfortably or barely an STL condition is satisfied. Positive means satisfied, negative means violated, larger absolute value means greater margin/severity.

Brier ScoreMean squared error between predicted probability and actual outcome (0 or 1). Like ECE, lower is better, summarizing overall quality of probability predictions.

Jaccard IndexA similarity metric from 0-1 expressing how much two sets overlap. Here, measuring how much STL patterns mined from different tasks overlap.

Original Abstract (Expand)

Large Language Models (LLMs) increasingly rely on long-form, multi-step reasoning to solve complex tasks such as mathematical problem solving and scientific question answering. Despite strong performance, existing confidence estimation methods typically reduce an entire reasoning process to a single scalar score, ignoring how confidence evolves throughout the generation. As a result, these methods are often sensitive to superficial factors such as response length or verbosity, and struggle to distinguish correct reasoning from confidently stated errors. We propose to characterize the stepwise confidence signal using Signal Temporal Logic (STL). Using a discriminative STL mining procedure, we discover temporal formulas that distinguish confidence signals of correct and incorrect responses. Our analysis found that the STL patterns generalize across tasks, and numeric parameters exhibit sensitivity to individual questions. Based on these insights, we develop a confidence estimation approach that informs STL blocks with parameter hypernetworks. Experiments on multiple reasoning tasks show our confidence scores are more calibrated than the baselines.