Confidence Estimation for LLMs in Multi-turn Interactions
TL;DR Highlight
The first systematic study measuring whether chatbots produce well-calibrated confidence scores as conversations grow longer — existing methods all fall short, and the newly proposed P(SUFFICIENT) is the best alternative.
Who Should Read
Developers thinking through when an LLM should act with confidence in agent pipelines. ML engineers designing hallucination detection logic for multi-turn chatbots or human-in-the-loop systems.
Core Mechanics
- All existing confidence estimation methods (verbalized, self-consistency, P(TRUE)) show poor calibration in multi-turn conversations — InfoECE ranging 40–80%
- The newly proposed P(SUFFICIENT) is a logit-based probe that asks 'Are the hints gathered so far sufficient to uniquely confirm this answer?' — it performs best on both monotonicity and calibration
- For Llama3.1-70B, P(SUFFICIENT) achieves an InfoECE of 5.27% on the GUESS dataset, overwhelmingly lower than other methods (up to 79.97%)
- In placebo (fake hint) experiments, only P(SUFFICIENT) actually lowers confidence on meaningless turns — it tracks real information rather than just growing more confident with turn count
- When comparing multi-turn conversation history against a single-turn summary, accuracy differs by less than 1%, but confidence signals vary significantly depending on the method
- Larger models (70B/72B) show significantly higher monotonicity τ for P(SUFFICIENT) — τ=93.91% on Qwen2.5-72B GUESS
Evidence
- P(SUFFICIENT) InfoECE on Llama3.1-70B GUESS: 5.27% (vs. VANILLA-VERB 65.52%, SC 56.88%)
- Kendall's τ (ground truth) for P(SUFFICIENT): Llama3.1-70B — 20Q 91.62%, GUESS 86.55%, GRACE 85.90%
- P(TRUE) shows confidence increase even with placebo hints — Llama3.1-8B GUESS: +11.75, Qwen2.5-72B: +14.61 (p<10⁻⁶)
- P(SUFFICIENT) actually decreases with placebo hints — Llama3.1-70B GUESS: 14.27→2.97 (p<0.05)
How to Apply
- When an agent decides whether to make a tool call, replacing 'Is this answer correct?' (P(TRUE)) with 'Is the information gathered so far sufficient to confirm this answer?' yields a more reliable confidence signal
- In RAG or clarification loops where the LLM decides whether to request more information, you can add a P(SUFFICIENT)-style probe at each turn and implement logic that triggers follow-up questions when confidence doesn't exceed a threshold
- When compressing multi-turn conversation history into a single summary before passing it to the model, smaller models (8B-class) suffer degraded calibration — preserving the original turn structure is safer
Code Example
# Example of estimating confidence using the P(SUFFICIENT) approach
# A binary probe that checks whether the hints/dialogue so far uniquely confirm the answer
PSUFFICIENT_PROMPT = """
{dialogue_history}
Based only on the information and hints provided above,
does that information sufficiently entail that the correct answer is exactly {answer}?
A. Yes — the information is sufficient to conclude {answer}.
B. No — the information is insufficient, allows alternatives, or contradicts {answer}.
Output format: **A** or **B** only (single uppercase letter; no spaces, punctuation, or explanation):
"""
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
def get_p_sufficient(model, tokenizer, dialogue_history: str, answer: str) -> float:
prompt = PSUFFICIENT_PROMPT.format(
dialogue_history=dialogue_history,
answer=answer
)
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits[0, -1, :] # logits for the last token
token_A = tokenizer.encode("A", add_special_tokens=False)[0]
token_B = tokenizer.encode("B", add_special_tokens=False)[0]
probs = torch.softmax(logits[[token_A, token_B]], dim=0)
return probs[0].item() # P(A) = P(SUFFICIENT)
# Usage example
# confidence = get_p_sufficient(model, tokenizer, history, current_answer)
# if confidence < 0.5: agent requests additional clarificationTerminology
Related Resources
Original Abstract (Expand)
While confidence estimation is a promising direction for mitigating hallucinations in Large Language Models (LLMs), current research dominantly focuses on single-turn settings. The dynamics of model confidence in multi-turn conversations, where context accumulates and ambiguity is progressively resolved, remain largely unexplored. Reliable confidence estimation in multi-turn settings is critical for many downstream applications, such as autonomous agents and human-in-the-loop systems. This work presents the first systematic study of confidence estimation in multi-turn interactions, establishing a formal evaluation framework grounded in two key desiderata: per-turn calibration and monotonicity of confidence as more information becomes available. To facilitate this, we introduce novel metrics, including a length-normalized Expected Calibration Error (InfoECE), and a new"Hinter-Guesser"paradigm for generating controlled evaluation datasets. Our experiments reveal that widely-used confidence techniques struggle with calibration and monotonicity in multi-turn dialogues. We propose P(Sufficient), a logit-based probe that achieves comparatively better performance, although the task remains far from solved. Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.