Causal Evidence that Language Models use Confidence to Drive Behavior

Mar 23, 2026•Dharshan Kumaran, Nathaniel Daw, Simon Osindero +2•View PDF

TL;DR Highlight

A 4-stage experiment provides causal evidence that major LLMs like GPT-4o and Gemma 3 27B actually use internal confidence signals to decide whether to answer.

Who Should Read

AI engineers applying LLM hallucination reduction or abstention features in production. ML researchers and system architects who want to understand LLM metacognition and design more reliable AI systems.

Core Mechanics

GPT-4o holds an implicit threshold — without explicit instructions, it auto-abstains when internal confidence drops below ~77%
Confidence's effect size for predicting abstention (βstd=0.99) is ~10x larger than RAG score, sentence embeddings, or problem difficulty
Injecting high/low confidence vectors into Gemma 3 27B via activation steering shifts abstention rate by 59.5pp (66.5%→7.0%) — direct causal evidence
Mediation analysis: 67.1% of the activation steering effect is transmitted via confidence redistribution, 26.2% via decision policy change
Explicitly instructing the model with a confidence threshold (0–100%) adjusts abstention accordingly; prior confidence remains a strong predictor — Stage 1 (confidence formation) and Stage 2 (threshold policy) operate independently
GPT-4o weights internal confidence 1.8x more than instructed threshold; abstention baselines vary dramatically across models (DeepSeek 82%, GPT-4o 56.6%, Qwen 80B 43.8%, Gemma 3 27B 27.2%)

Evidence

Activation steering shifts abstention: low-confidence(-2) 66.5% vs high-confidence(+2) 7.0% — 59.5pp difference (r=-0.99, p<0.001)
Phase 2 logistic regression: confidence |βstd|=0.99 vs RAG(0.102), difficulty(0.110), embeddings(0.106) — roughly 9–10x larger
Phase 4: confidence+threshold model vs threshold-only: AIC -1953, pseudo-R² 0.11→0.24 (χ²(1)=1955, p<0.001)
Mediation analysis: confidence redistribution indirect path explains 67.1% of total steering effect (indirect effect a1×b1=-0.55, 95% CI [-0.65, -0.47])

How to Apply

To improve production LLM reliability, add "abstain if confidence < T%" to the prompt (Phase 4 style) — each 10–20% threshold increase trades ~1.1% accuracy gain for higher abstention
If a model rarely abstains (like Gemma 3 27B <5%), generate 20 paraphrases of the same abstention-instruction prompt and select the version with the highest abstention rate
For open-source models with internal activation access (e.g. Gemma 3 27B): extract steering vectors by contrasting residual streams from high vs low confidence trials, inject at layers 30–40 during inference to programmatically control abstention rate

Code Example

snippet

# Phase 4 approach: Add confidence threshold instruction to prompt
prompt_template = """
You will be given a 4-way multiple choice question, with options 1-4.
First, internally estimate the probability (0-100) that your answer is correct.
Then:
- If your confidence is MORE than {threshold}%, output only the number of your answer.
- If your confidence is LESS than {threshold}%, output '5' to indicate you want to abstain.
Output only a single number.

Question: {question}
1) {option1}
2) {option2}
3) {option3}
4) {option4}

Answer:"""

# Adjusting threshold between 0~100 controls the accuracy/coverage tradeoff
# High threshold (e.g., 80%) → accuracy↑, coverage↓
# Low threshold (e.g., 30%) → accuracy↓, coverage↑
for threshold in range(0, 101, 10):
    prompt = prompt_template.format(
        threshold=threshold,
        question="Who won the Nobel Prize in Physics in 2024?",
        option1="Geoffrey Hinton",
        option2="Yann LeCun",
        option3="John Hopfield",
        option4="Andrew Ng"
    )
    # response = llm.generate(prompt)
    # If '5' is returned → abstention, if 1-4 → corresponding answer

Original Abstract (Expand)

Metacognition -- the ability to assess one's own cognitive performance -- is documented across species, with internal confidence estimates serving as a key signal for adaptive behavior. While confidence can be extracted from Large Language Model (LLM) outputs, whether models actively use these signals to regulate behavior remains a fundamental question. We investigate this through a four-phase abstention paradigm.Phase 1 established internal confidence estimates in the absence of an abstention option. Phase 2 revealed that LLMs apply implicit thresholds to these estimates when deciding to answer or abstain. Confidence emerged as the dominant predictor of behavior, with effect sizes an order of magnitude larger than knowledge retrieval accessibility (RAG scores) or surface-level semantic features. Phase 3 provided causal evidence through activation steering: manipulating internal confidence signals correspondingly shifted abstention rates. Finally, Phase 4 demonstrated that models can systematically vary abstention policies based on instructed thresholds.Our findings indicate that abstention arises from the joint operation of internal confidence representations and threshold-based policies, mirroring the two-stage metacognitive control found in biological systems. This capacity is essential as LLMs transition into autonomous agents that must recognize their own uncertainty to decide when to act or seek help.