RAG 시스템에서 LLM의 Noise-Aware Verbal Confidence Calibration

NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems

Jan 16, 2026•Jiayu Liu, Rui Wang, Qing Zong +7•View PDF

TL;DR Highlight

RAG에서 잘못된 검색 결과가 섞이면 LLM이 틀려도 자신만만해지는 문제를, 2K 데이터 파인튜닝으로 고친 논문.

Who Should Read

RAG 파이프라인을 프로덕션에 운영 중이고, 모델이 확신 있게 틀린 답을 내놓는 상황을 개선하고 싶은 백엔드·ML 엔지니어. 특히 의료·법률·금융처럼 모델 신뢰도가 중요한 도메인을 다루는 팀.

Core Mechanics

RAG 환경에서 테스트한 모든 모델(Llama-3.1-8B, Qwen2.5-7B, DeepSeek-R1-Distill 계열)의 평균 ECE(신뢰도-정확도 불일치 지표)가 0.4를 초과 — 기준치 0.25를 훨씬 넘는 심각한 수준
반박 정보(counterfactual passage)가 섞이면 모델이 틀린 답에도 높은 확신을 유지함 — 충돌하는 증거가 있어도 자신감이 떨어지지 않는 '과신' 발생
관련 없는 패시지가 섞여도 오히려 자신감이 올라감 — 텍스트가 더 많을수록 근거 없이 확신이 강해지는 현상 확인
NAACL Rules 3가지 제시: ① 충돌 증거 있으면 외부 대신 내부 지식 사용, ② 무관한 패시지는 무시, ③ 유효한 패시지 없으면 파라메트릭 지식으로 폴백
2K HotpotQA 예제로 LoRA SFT를 해서 모델이 패시지 품질을 판단하고 그에 맞게 신뢰도를 조절하도록 학습 — 외부 teacher 모델 불필요
단순히 정답+신뢰도 레이블만 학습시킨 'Label-only SFT'보다 NAACL이 더 좋음 — 신뢰도 개선은 레이블 피팅이 아니라 노이즈 인식 추론 과정에서 나옴

Evidence

NAACL 적용 후 in-domain ECE 평균 10.9% 개선, out-of-distribution ECE 8.0% 개선 (4개 벤치마크 기준)
Llama-3.1-8B에서 CoT 대비 ECE 0.377→0.266으로 약 11% 감소, AUROC 0.591→0.751로 향상
counterfactual 노이즈 추가 시 Llama-3.1-8B ECE가 Gold-only 대비 평균 31.6%, DeepSeek-R1-Distill-Llama-8B는 35.1% 급증
훈련 때 k=3 패시지로 학습했지만 k=5로 테스트해도 Vanilla 대비 ECE 평균 8% 개선 — 일반화 능력 확인

How to Apply

즉시 적용 가능한 프롬프트 방식: 논문의 Noise-aware 프롬프트(Figure 8)를 그대로 사용 — 패시지를 'Highly Relevant / Relevant / Irrelevant'로 분류하고 3가지 규칙을 따르도록 지시하면 파인튜닝 없이도 CoT 대비 ECE 개선 효과 있음
파인튜닝 적용 시: HotpotQA 2K 샘플에 counterfactual/consistent/irrelevant 3가지 시나리오 패시지를 합성 생성 후 LoRA SFT — 공개 코드(https://github.com/HKUST-KnowComp/NAACL)와 LLaMA-Factory 활용
RAG 답변 신뢰도를 사용자에게 노출할 때, 모델이 명시적으로 패시지를 판단한 뒤 신뢰도를 출력하게 하면 해석 가능성도 올라감 — 단순 숫자가 아니라 '왜 이 신뢰도인지' 추적 가능

Code Example

snippet

# Noise-Aware 프롬프트 (즉시 사용 가능, 파인튜닝 불필요)
SYSTEM_PROMPT = """
You will be asked a question with 3 retrieved passages.
Classify each passage:
- Highly Relevant: directly states or strongly indicates an answer
- Relevant: shares topic/keywords but lacks specific answer info
- Irrelevant: no shared topic or keywords

Rules:
1. If multiple passages are Highly Relevant AND contradictory:
   → Use your own knowledge, report corresponding confidence
2. If exactly one passage is Highly Relevant:
   → Answer based on that passage, report corresponding confidence
3. If no passage is Highly Relevant:
   → Use your own knowledge, report corresponding confidence

Think step by step, classify each passage, then output:
Final Answer: [answer]
Confidence: [0%-100%]
"""

# 사용 예시
user_message = f"""
Question: {question}
Retrieved Passages:
[P1] {passage1}
[P2] {passage2}
[P3] {passage3}
"""

# 신뢰도 파싱
import re
def parse_confidence(response: str) -> float:
    match = re.search(r'Confidence:\s*(\d+)%', response)
    return int(match.group(1)) / 100 if match else 0.5

Terminology

ECEExpected Calibration Error. 모델이 '70% 확신한다'고 했을 때 실제로 70%만큼 맞추는지를 측정하는 지표. 0에 가까울수록 좋고, 0.4면 자신감과 실제 성능이 크게 어긋나 있다는 뜻.

AUROCArea Under the ROC Curve. 모델이 '이건 맞다/틀리다'를 얼마나 잘 구분하는지 나타내는 지표. 0.5는 찍기 수준, 1.0은 완벽 구분.

verbal confidence모델이 숫자나 확률값으로 자신감을 직접 말로 표현하는 것. '70%' 또는 '매우 확실함' 같은 형태. 내부 로짓을 뒤지지 않아도 되니 GPT-4 같은 폐쇄 모델에서도 쓸 수 있음.

calibration모델의 자신감과 실제 정확도가 얼마나 잘 맞는지. 잘 캘리브레이션된 모델은 '80% 확신'이라고 하면 실제로 80번 중 80번 맞춤. 과신(overconfidence)은 틀려도 높은 확신을 보이는 것.

counterfactual passage질문과 관련은 있지만 정답과 반대되는 그럴듯한 정보를 담은 패시지. 예: '프리미어리그는 Canal+에서 방영' 대신 '프리미어리그는 TF1에서 방영'이라고 쓴 가짜 정보.

SFTSupervised Fine-Tuning. 정답 예시를 보여주며 따라 하게 학습시키는 방법. 강화학습보다 간단하고 저렴하며, 이 논문에서는 LoRA 방식으로 2K 데이터만 써서 효율적으로 학습.

LoRA적은 수의 파라미터만 학습하는 파인튜닝 기법. 모델 전체를 다 바꾸지 않고 작은 어댑터 레이어만 추가해서 학습 — GPU 메모리와 시간을 크게 절약.

parametric knowledge모델 가중치 안에 학습으로 저장된 지식. RAG에서 외부 검색 없이 모델 자체가 알고 있는 정보. 검색 결과가 모두 쓸모없거나 충돌할 때 이 내부 지식으로 폴백하는 게 이 논문의 핵심 전략 중 하나.

Related Resources

https://github.com/HKUST-KnowComp/NAACL

Original Abstract (Expand)

Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance due to noisy retrieved contexts. Specifically, contradictory or irrelevant evidence tends to inflate the model's false certainty, leading to severe overconfidence. To address this, we propose NAACL Rules (Noise-AwAre Confidence CaLibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NAACL, a noise-aware calibration framework that synthesizes supervision from about 2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NAACL equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NAACL yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NAACL paves the way for both accurate and epistemically reliable LLMs.