Entropy Trajectory Shape로 LLM Chain-of-Thought 추론 신뢰도 예측하기

Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Mar 19, 2026•Xinghao Zhao•View PDF

TL;DR Highlight

CoT 추론 중 각 단계마다 모델의 불확실성이 단조롭게 감소하는지 확인하면, 비싼 self-consistency 없이도 정답 여부를 저렴하게 예측할 수 있다.

Who Should Read

LLM 기반 수학/추론 문제 풀이 시스템을 운영하면서 정답 신뢰도를 높이고 싶은 ML 엔지니어나 AI 서비스 개발자. 특히 self-consistency처럼 샘플을 여러 번 뽑는 방식의 비용을 줄이고 싶은 분.

Core Mechanics

CoT(Chain-of-Thought) 추론의 각 단계마다 m=5개의 짧은 완성 샘플을 뽑아 Shannon entropy를 계산하면, entropy가 매 단계 단조 감소하는 chain이 그렇지 않은 chain보다 정확도가 +21.9%p 높음 (Qwen2.5-7B, GSM8K)
'총 entropy 감소량(scalar coherence)'은 정확도와 상관없음(ρ=−0.06) — 얼마나 많이 줄었는지가 아니라 매 스텝 꾸준히 줄었는지가 핵심
위반 횟수(violation count)로 세분화하면 0회/1회/2회 위반 시 정확도가 68.8%/50.8%/28.6%로 단조 감소 — 위반 크기는 예측력 없음
Token log-probability 기반 신뢰도는 추론 스텝이 깊어질수록 오히려 캘리브레이션이 나빠짐 (ECE: step 0→7에서 0.186→0.312)
Mistral-7B-Instruct-v0.3으로 재현하면 gap이 +34.7%p (OR=4.33)로 더 크게 나타나 모델 패밀리 무관하게 유효
비용은 약 1,500 tokens/문제로 SC@10(~3,000토큰)의 절반, SC@40(~12,000토큰)의 1/8 수준

Evidence

Qwen2.5-7B-Instruct + GSM8K(n=300): monotone chain 68.8% vs non-monotone 46.8%, +21.9%p 차이 (Fisher's exact p=0.0005, OR=2.50)
Mistral-7B-Instruct-v0.3 재현: 72.3% vs 37.6%, +34.7%p (OR=4.33, p<10⁻⁸)
동일 예산 비교(73.7% coverage): entropy monotonicity 68.8% vs SC@3 70.6%, SC@5 69.7% — 비용 대비 경쟁력 있음
m=3/5/10 샘플 수 변경 시 gap 변화 1.5%p 이내(+22.5/+21.9/+21.2%p), ε=0~0.10 범위에서 +21.9%p 그대로 유지

How to Apply

CoT 생성 후 각 추론 단계의 prefix에서 temperature=0.7로 m=5개의 짧은 완성(max 150토큰)을 샘플링해 Shannon entropy를 계산 — 매 단계 entropy가 증가하는 위반이 0이면 monotone으로 분류하고 신뢰도 높은 답변으로 처리
비용을 더 줄이고 싶다면 첫 2개 transition만 확인하는 prefix 방식을 사용 — 전체 비용의 60%로 최종 gap의 76%를 회수할 수 있어 early-exit 트리지에 유용
non-monotone으로 분류된 질문에만 선택적으로 self-consistency를 적용하는 하이브리드 전략으로 평균 토큰 비용을 full SC 대비 대폭 절감 가능

Code Example

snippet

import numpy as np
from collections import Counter

def compute_entropy(answers: list[str]) -> float:
    """주어진 답변 리스트에서 Shannon entropy 계산"""
    counts = Counter(answers)
    total = len(answers)
    probs = [c / total for c in counts.values()]
    return -sum(p * np.log(p) for p in probs if p > 0)

def is_monotone_chain(prefix_entropies: list[float], eps: float = 0.01) -> bool:
    """entropy trajectory가 단조 감소하는지 확인"""
    for i in range(len(prefix_entropies) - 1):
        if prefix_entropies[i + 1] > prefix_entropies[i] + eps:
            return False
    return True

def count_violations(prefix_entropies: list[float], eps: float = 0.01) -> int:
    """위반 횟수 반환 (0=신뢰도 높음, 1=중간, 2+=낮음)"""
    return sum(
        1 for i in range(len(prefix_entropies) - 1)
        if prefix_entropies[i + 1] > prefix_entropies[i] + eps
    )

# 사용 예시
# 각 step prefix까지의 텍스트로 m=5개 답변 샘플링 후 entropy 계산
step_answers = [
    ["42", "42", "42", "40", "42"],  # step 0: 불확실
    ["42", "42", "42", "42", "40"],  # step 1: 점점 확실
    ["42", "42", "42", "42", "42"],  # step 2: 확실
]

entropies = [compute_entropy(answers) for answers in step_answers]
print(f"Entropy trajectory: {entropies}")
print(f"Monotone: {is_monotone_chain(entropies)}")  # True면 신뢰도 높음
print(f"Violation count: {count_violations(entropies)}")  # 0=high, 1=mid, 2+=low

Terminology

Chain-of-Thought (CoT)LLM이 최종 답을 바로 내놓지 않고 '1단계: ..., 2단계: ...' 식으로 중간 풀이 과정을 생성하는 기법. 사람이 문제를 풀 때 계산 과정을 적는 것과 같음.

Shannon Entropy확률 분포가 얼마나 불확실한지를 숫자로 나타낸 것. 5개 샘플이 모두 같은 답이면 entropy=0(확실), 5개가 모두 다른 답이면 entropy가 최대(불확실).

Self-Consistency같은 질문에 여러 번 답을 생성해서 가장 많이 나온 답을 최종 답으로 채택하는 방식. 정확도는 높지만 토큰 비용이 10~40배 들어 비쌈.

ECE (Expected Calibration Error)모델이 '나 90% 확신해'라고 할 때 실제로 90%의 확률로 맞는지를 측정하는 캘리브레이션 오차. 낮을수록 모델의 자신감과 실제 정확도가 일치함.

Monotonicity추론 단계가 진행될수록 불확실성(entropy)이 한 번도 올라가지 않고 계속 내려가는 성질. 마치 모델이 풀다가 헷갈리지 않고 꾸준히 확신을 쌓아가는 것.

Odds Ratio (OR)두 그룹 간 어떤 사건이 발생할 비율의 비교값. OR=2.50이면 monotone chain이 정답일 확률이 non-monotone보다 2.5배 높다는 의미.

Token Log-ProbabilityLLM이 각 토큰을 생성할 때 부여하는 확률의 로그값. 모델의 자신감을 나타내는 가장 기본적인 신호지만, 이 논문에서는 추론이 깊어질수록 신뢰하기 어려워짐을 보임.

Original Abstract (Expand)

Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive. We study whether the shape of uncertainty dynamics across reasoning steps--captured by sampling a few answer completions per step--predicts correctness. We introduce entropy-trajectory monotonicity: a chain is monotone if its per-step answer-distribution entropy decreases at every step. On GSM8K (n=300) with Qwen2.5-7B-Instruct, monotone chains achieve 68.8% accuracy vs. 46.8% for non-monotone chains (+21.9 pp; Fisher's p=0.0005; OR=2.50). Critically, total entropy reduction is not predictive ($ρ$=-0.06, p=0.31), revealing a shape-over-magnitude dissociation: whether entropy decreases at every step matters, not how much. Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Token log-probability confidence worsens in calibration with step depth (ECE: 0.186->0.312), and monotonicity achieves +5.8 pp at 73.7% coverage, outperforming scalar baselines at approx 1,500 tokens/question--1/8 the cost of 40-chain self-consistency. Results replicate on Mistral-7B (n=300): monotone chains reach 72.3% vs. 37.6% (+34.7 pp; OR=4.33). Structural properties of uncertainty trajectories are thus more informative than aggregate measures.