Confidence Estimation for LLMs in Multi-turn Interactions

Jan 5, 2026•Caiqi Zhang, Ruihan Yang, Xiaochen Zhu +5•View PDF

TL;DR Highlight

The first systematic study measuring whether chatbots produce well-calibrated confidence scores as conversations grow longer — existing methods all fall short, and the newly proposed P(SUFFICIENT) is the best alternative.

Who Should Read

Developers thinking through when an LLM should act with confidence in agent pipelines. ML engineers designing hallucination detection logic for multi-turn chatbots or human-in-the-loop systems.

Core Mechanics

All existing confidence estimation methods (verbalized, self-consistency, P(TRUE)) show poor calibration in multi-turn conversations — InfoECE ranging 40–80%
The newly proposed P(SUFFICIENT) is a logit-based probe that asks 'Are the hints gathered so far sufficient to uniquely confirm this answer?' — it performs best on both monotonicity and calibration
For Llama3.1-70B, P(SUFFICIENT) achieves an InfoECE of 5.27% on the GUESS dataset, overwhelmingly lower than other methods (up to 79.97%)
In placebo (fake hint) experiments, only P(SUFFICIENT) actually lowers confidence on meaningless turns — it tracks real information rather than just growing more confident with turn count
When comparing multi-turn conversation history against a single-turn summary, accuracy differs by less than 1%, but confidence signals vary significantly depending on the method
Larger models (70B/72B) show significantly higher monotonicity τ for P(SUFFICIENT) — τ=93.91% on Qwen2.5-72B GUESS

Evidence

P(SUFFICIENT) InfoECE on Llama3.1-70B GUESS: 5.27% (vs. VANILLA-VERB 65.52%, SC 56.88%)
Kendall's τ (ground truth) for P(SUFFICIENT): Llama3.1-70B — 20Q 91.62%, GUESS 86.55%, GRACE 85.90%
P(TRUE) shows confidence increase even with placebo hints — Llama3.1-8B GUESS: +11.75, Qwen2.5-72B: +14.61 (p<10⁻⁶)
P(SUFFICIENT) actually decreases with placebo hints — Llama3.1-70B GUESS: 14.27→2.97 (p<0.05)

How to Apply

When an agent decides whether to make a tool call, replacing 'Is this answer correct?' (P(TRUE)) with 'Is the information gathered so far sufficient to confirm this answer?' yields a more reliable confidence signal
In RAG or clarification loops where the LLM decides whether to request more information, you can add a P(SUFFICIENT)-style probe at each turn and implement logic that triggers follow-up questions when confidence doesn't exceed a threshold
When compressing multi-turn conversation history into a single summary before passing it to the model, smaller models (8B-class) suffer degraded calibration — preserving the original turn structure is safer

Code Example

snippet

# Example of estimating confidence using the P(SUFFICIENT) approach
# A binary probe that checks whether the hints/dialogue so far uniquely confirm the answer

PSUFFICIENT_PROMPT = """
{dialogue_history}

Based only on the information and hints provided above,
does that information sufficiently entail that the correct answer is exactly {answer}?

A. Yes — the information is sufficient to conclude {answer}.
B. No — the information is insufficient, allows alternatives, or contradicts {answer}.

Output format: **A** or **B** only (single uppercase letter; no spaces, punctuation, or explanation):
"""

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def get_p_sufficient(model, tokenizer, dialogue_history: str, answer: str) -> float:
    prompt = PSUFFICIENT_PROMPT.format(
        dialogue_history=dialogue_history,
        answer=answer
    )
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits[0, -1, :]  # logits for the last token
    
    token_A = tokenizer.encode("A", add_special_tokens=False)[0]
    token_B = tokenizer.encode("B", add_special_tokens=False)[0]
    
    probs = torch.softmax(logits[[token_A, token_B]], dim=0)
    return probs[0].item()  # P(A) = P(SUFFICIENT)

# Usage example
# confidence = get_p_sufficient(model, tokenizer, history, current_answer)
# if confidence < 0.5: agent requests additional clarification

Terminology

Confidence EstimationExpressing how likely the model thinks its own answer is correct as a number between 0 and 1. A 'good' estimate means low numbers when the model is likely wrong and high numbers when it is certain.

CalibrationA model is well-calibrated if, when it says '70% confident,' it is actually correct about 70% of the time. If it always says '99% confident' but is only right half the time, it is poorly calibrated.

InfoECEA calibration error metric designed to fairly compare conversations of different lengths, such as multi-turn dialogues. Lower values indicate better calibration.

MonotonicityThe property that confidence continues to increase as more hints accumulate over the course of a conversation. An ideal agent should become more certain as information grows.

Kendall's τA rank correlation coefficient. Here it measures how consistently confidence increases as turns progress. Values closer to 1 indicate a more strictly monotonic increase.

Verbalized ConfidenceA method that directly asks the model 'How confident are you?' and has it output a number. Easy to use with a single prompt, but the result may not reflect the model's true internal confidence.

Self-ConsistencyA method that estimates confidence by sampling the same question multiple times and measuring how often the answers agree. If the same answer appears 16 out of 20 times, confidence = 0.8.

LogitThe raw score the model computes just before outputting each token. It is converted to a probability via softmax and serves as a direct signal of the model's internal confidence.

Related Resources

https://arxiv.org/abs/2601.02179

Original Abstract (Expand)

While confidence estimation is a promising direction for mitigating hallucinations in Large Language Models (LLMs), current research dominantly focuses on single-turn settings. The dynamics of model confidence in multi-turn conversations, where context accumulates and ambiguity is progressively resolved, remain largely unexplored. Reliable confidence estimation in multi-turn settings is critical for many downstream applications, such as autonomous agents and human-in-the-loop systems. This work presents the first systematic study of confidence estimation in multi-turn interactions, establishing a formal evaluation framework grounded in two key desiderata: per-turn calibration and monotonicity of confidence as more information becomes available. To facilitate this, we introduce novel metrics, including a length-normalized Expected Calibration Error (InfoECE), and a new"Hinter-Guesser"paradigm for generating controlled evaluation datasets. Our experiments reveal that widely-used confidence techniques struggle with calibration and monotonicity in multi-turn dialogues. We propose P(Sufficient), a logit-based probe that achieves comparatively better performance, although the task remains far from solved. Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.