The First Token Knows: Single-Decode Confidence for Hallucination Detection | AI Paper Digest

TL;DR Highlight

LLM이 답변의 첫 토큰을 생성할 때의 확률 분포만 봐도, 10번 샘플링하는 semantic self-consistency와 맞먹는 hallucination 탐지 성능이 나온다.

Who Should Read

LLM 서비스에서 hallucination 탐지 파이프라인을 구축 중인 백엔드/ML 엔지니어. 특히 추론 비용을 줄이면서도 신뢰도 측정이 필요한 상황에서 유용하다.

Core Mechanics

첫 토큰 confidence(ϕfirst)는 모델이 greedy decode로 답변을 생성할 때 첫 번째 의미 있는 토큰 위치에서 top-100 logit의 정규화된 엔트로피를 측정한다. 값이 1에 가까울수록 모델이 확신하는 것.
ϕfirst는 단 1번의 forward pass만 필요한데, semantic self-consistency(greedy 1회 + 샘플링 10회 + NLI 클러스터링)와 비교해 생성 비용이 약 1/11 수준이다.
Llama-3.1-8B, Mistral-7B-v0.3, Qwen2.5-7B 세 모델과 PopQA, TriviaQA 두 벤치마크(각 n=1000)에서 ϕfirst의 평균 AUROC는 0.820으로, semantic self-consistency(0.793)와 surface-form self-consistency(0.791)를 앞섰다.
6개 dataset-model 조합 중 5개에서 ϕfirst가 가장 높은 AUROC를 기록했고, 나머지 1개에서도 최강 방법과 0.002 차이 이내였다.
ϕfirst와 semantic AU를 logistic ensemble로 합쳐도 AUROC가 평균 +0.021밖에 안 오른다. 즉, ϕfirst가 semantic AU의 정보를 이미 대부분 담고 있다는 뜻.
모델에게 '너 얼마나 확신해?'라고 직접 물어보는 verbalized confidence는 평균 AUROC 0.700으로 가장 낮아서, LLM의 자기 보고 신뢰도는 믿기 어렵다는 걸 재확인했다.

Evidence

전체 평균 AUROC: ϕfirst 0.820 vs semantic AU 0.793 vs AU-full 0.791. PopQA에서는 ϕfirst가 +0.036, TriviaQA에서는 +0.016 앞섰다.
ϕfirst와 semantic AU의 Pearson 상관계수는 0.54~0.76(평균 0.67)이고, 둘을 합쳐도 AUROC 개선이 평균 +0.021에 불과해 subsumption(포함 관계)이 성립한다.
paired bootstrap test에서 ϕfirst는 AU-full 대비 6개 셀 중 4개에서(p<0.05), semantic AU 대비 3개에서 유의미하게 우세했고, 나머지는 동등 수준으로 판정됐다.
답변 길이와 ϕfirst의 원시 상관계수는 -0.11~-0.25이지만, 정답 여부를 통제하면 PopQA에서는 거의 0에 수렴(-0.02~-0.04)해서 길이 편향이 실제로는 정답률 차이 때문임을 확인했다.

How to Apply

모델 API가 logit을 노출하는 경우(Hugging Face transformers 등), 답변 생성 시 첫 번째 의미 있는 토큰의 top-100 logit을 뽑아 정규화 엔트로피를 계산하면 별도 샘플링 없이 hallucination 위험도 점수를 얻을 수 있다.
기존에 temperature sampling을 10회 돌려서 self-consistency를 측정하고 있다면, 우선 ϕfirst를 baseline으로 먼저 측정해보고, 추가 비용 대비 성능 향상이 있는지 비교한 후 sampling 횟수를 결정하는 방식으로 비용을 아낄 수 있다.
RAG가 아닌 closed-book QA(모델의 파라메트릭 지식으로만 답하는 상황)에서 답변 신뢰도 필터링이 필요할 때, ϕfirst 임계값(예: 0.7 이하면 '불확실' 태그)을 설정해 낮은 신뢰도 답변을 별도 처리하는 파이프라인을 구성할 수 있다.

Code Example

snippet

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

def compute_phi_first(question: str, K: int = 100) -> float:
    """
    첫 번째 content-bearing 토큰의 정규화 엔트로피로 hallucination 위험도 측정
    반환값: 0(불확실) ~ 1(확신)
    """
    prompt = f"Answer the following question in a short phrase.\nQuestion: {question}\nAnswer:"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model(**inputs)
        # 마지막 입력 토큰 다음 위치의 logit (첫 번째 생성 토큰)
        logits = outputs.logits[0, -1, :]  # shape: [vocab_size]
    
    # top-K logit 추출 및 재정규화
    top_k_logits, _ = torch.topk(logits, K)
    top_k_probs = torch.softmax(top_k_logits, dim=-1)
    
    # 정규화 엔트로피 계산
    import math
    H = -torch.sum(top_k_probs * torch.log(top_k_probs + 1e-12)).item()
    H_normalized = H / math.log(K)
    
    phi_first = 1.0 - H_normalized  # 1에 가까울수록 확신
    return phi_first

# 사용 예시
question = "Who wrote Hamlet?"
phi = compute_phi_first(question)
print(f"ϕfirst: {phi:.3f}")
if phi < 0.5:
    print("⚠️ 모델이 불확실합니다. 답변을 신뢰하기 어렵습니다.")
else:
    print("✅ 모델이 확신합니다.")

Terminology

AUROC분류 모델이 얼마나 잘 구분하는지 나타내는 점수. 0.5는 동전 던지기 수준, 1.0은 완벽한 구분. 보통 0.8 이상이면 쓸만하다고 봄.

semantic self-consistency같은 질문을 여러 번 던져서 의미가 비슷한 답들끼리 묶고, 가장 많이 나온 의미의 답을 선택하는 방법. NLI 모델로 의미 유사도를 판단함.

NLI자연어 추론(Natural Language Inference). 두 문장이 '같은 말', '반대 말', '관계없음' 중 어느 관계인지 판단하는 AI 기법.

greedy decode각 토큰을 생성할 때 항상 확률이 가장 높은 토큰 하나만 선택하는 방식. 매번 같은 결과가 나오는 결정론적 방법.

logitAI 모델이 각 토큰(단어 조각)을 다음에 생성할 확률을 계산하기 전 단계의 원시 점수. softmax를 거치면 확률이 됨.

엔트로피불확실성의 정도를 수치로 표현한 것. 모든 가능성이 고르게 분산되면 엔트로피가 높고(불확실), 하나에 몰려있으면 낮음(확신).

subsumption testA가 B의 정보를 이미 다 포함하는지 검증하는 분석. A만 써도 충분한지, B를 추가해도 의미 있는 개선이 없는지 확인함.

verbalized confidence모델한테 직접 '0~100점으로 네 확신도가 몇 점이야?'라고 물어보는 방식. 이 논문에서 가장 성능이 낮았음.

Related Papers

Original Abstract (Expand)

Self-consistency detects hallucinations by generating multiple sampled answers to a question and measuring agreement, but this requires repeated decoding and can be sensitive to lexical variation. Semantic self-consistency improves this by clustering sampled answers by meaning using natural language inference, but it adds both sampling cost and external inference overhead. We show that first-token confidence, phi_first, computed from the normalized entropy of the top-K logits at the first content-bearing answer token of a single greedy decode, matches or modestly exceeds semantic self-consistency on closed-book short-answer factual question answering. Across three 7-8B instruction-tuned models and two benchmarks, phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency. A subsumption test shows that phi_first is moderately to strongly correlated with semantic agreement, and combining the two signals yields only a small AUROC improvement over phi_first alone. These results suggest that much of the uncertainty information captured by multi-sample agreement is already available in the model's initial token distribution. We argue that phi_first should be reported as a default low-cost baseline before invoking sampling-based uncertainty estimation.