LLM 에이전트를 위한 Uncertainty Quantification 재검토

Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents

May 28, 2025•Michael Kirchhof, Gjergji Kasneci, Enkelejda Kasneci•View PDF

TL;DR Highlight

기존 aleatoric/epistemic uncertainty 이분법은 대화형 LLM 에이전트에 맞지 않으니, 새로운 불확실성 연구 방향 3가지가 필요하다는 포지션 페이퍼.

Who Should Read

챗봇이나 대화형 LLM 에이전트를 개발하면서 모델이 언제 틀리는지, 얼마나 확신하는지를 사용자에게 어떻게 알릴지 고민하는 AI 엔지니어. 할루시네이션 감지나 신뢰도 표시 기능을 설계하는 프로덕트 팀에게도 유용.

Core Mechanics

기존 uncertainty를 aleatoric(줄일 수 없는 불확실성)과 epistemic(학습으로 줄일 수 있는 불확실성)으로 나누는 이분법은 학파마다 정의가 충돌해서 이미 이론적으로 모순이 있음
실험적으로도 aleatoric/epistemic 추정값의 랭크 상관관계가 0.8~0.999로 거의 같은 값을 냄 — 둘을 분리해도 의미가 없다는 뜻
대화형 LLM에서는 follow-up 질문을 하면 aleatoric이 epistemic으로, epistemic이 aleatoric으로 바뀌기 때문에 이분법 자체가 무의미해짐
GPT-3.5-Turbo-16k조차 모호한 질문을 감지하는 정확도가 57%에 불과 (랜덤 베이스라인 50%와 거의 차이 없음)
새로운 연구 방향 3가지 제안: ① 사용자가 처음에 과제나 정보를 덜 줄 때의 underspecification uncertainty 탐지, ② 명확화 질문을 통한 interactive learning으로 불확실성 감소, ③ 숫자 하나가 아닌 언어/음성으로 풍부하게 불확실성 표현
LLM 에이전트가 '잘 모르겠다'를 숫자 대신 텍스트로 설명하면(왜 불확실한지, 어떤 선택지가 있는지) 사람이 더 잘 판단할 수 있음

Evidence

aleatoric/epistemic 추정값의 랭크 상관관계가 0.8~0.999 — 이론적으로 분리된 두 지표가 실제로는 같은 정보를 담고 있음 (Mucsányi et al., 2024, ImageNet-1k deep ensemble 실험)
GPT-3.5-Turbo-16k가 모호한 질문을 탐지하는 정확도 57% (랜덤 50%), 사람 평가자도 follow-up 질문이 유용하다고 답한 비율은 53%에 불과 (Zhang et al., 2024)
Natural Questions 테스트셋의 56%에서 질문이 모호해 정답이 하나가 아님 — 현재 LLM이 맞닥뜨리는 underspecification 문제의 규모 (Min et al., 2020)
2024년 기준 arXiv에 'aleatoric' 또는 'epistemic'을 제목/초록에 포함한 논문이 하루에 약 1편씩 출판되고 있으나, 이 논문은 그 방향성 자체에 문제가 있다고 지적

How to Apply

챗봇이 답변 전에 '이 질문은 국가 정보가 빠져 있어서 답이 달라질 수 있습니다. 어느 나라 기준인가요?' 같은 명확화 질문을 먼저 하도록 프롬프트를 설계하면 underspecification uncertainty를 줄일 수 있음
불확실한 답변을 '확률 0.6' 대신 '두 가지 가능성이 있습니다: A라면 ~, B라면 ~. 어떤 상황인지 알려주시면 더 정확히 답할 수 있습니다' 형식으로 출력하도록 시스템 프롬프트에 지침을 추가하면 사용자가 더 잘 판단 가능
자동화 파이프라인(사람이 아닌 다른 시스템과 통신)에서는 기존 수치 불확실성 스코어를 유지하되, 사용자 대면 인터페이스에서는 텍스트 기반 불확실성 표현으로 전환하는 이중 전략을 구성하면 두 환경 모두 커버 가능

Code Example

snippet

# 불확실성을 숫자가 아닌 텍스트로 표현하는 시스템 프롬프트 예시
SYSTEM_PROMPT = """
You are a helpful assistant. When you are uncertain about an answer:
1. DO NOT just say a confidence score like '70% confident'.
2. Instead, explain WHY you are uncertain.
3. List the competing possibilities and what distinguishes them.
4. Ask ONE clarifying question if missing information is the key issue.

Example of good uncertainty expression:
"There are two likely answers depending on context:
- If you're asking about the US release: November 2001
- If you're asking about the UK release: November 4, 2001
Could you clarify which country you're asking about?"

Example of bad uncertainty expression:
"I'm about 70% confident the answer is November 2001."
"""

Terminology

Uncertainty Quantification (UQ)모델이 얼마나 확신하는지를 수치로 나타내는 기법. 의사가 '90% 확률로 감기'라고 말하는 것처럼, AI도 자신의 답변 신뢰도를 표현하는 것.

Aleatoric Uncertainty데이터 자체가 원래 불명확해서 아무리 학습해도 줄일 수 없는 불확실성. 동전 던지기처럼 본질적으로 예측 불가능한 요소에서 오는 불확실함.

Epistemic Uncertainty모델이 아직 충분히 학습하지 못해서 생기는 불확실성. 책을 더 읽으면 줄어드는 '지식 부족'에서 오는 불확실함.

Underspecification Uncertainty사용자가 질문을 애매하게 하거나 정보를 덜 줬을 때 생기는 불확실성. '해리포터 언제 나왔어?'에서 어느 나라 개봉일인지 안 알려준 상황.

Interactive Learning능동학습(Active Learning)에서 착안한 개념으로, AI가 사용자에게 명확화 질문을 던져서 현재 대화의 불확실성을 줄이는 방식. 전체 모델을 업데이트하는 게 아니라 지금 이 대화를 더 잘 풀기 위한 질문.

Calibration모델이 '80% 확신'이라고 하면 실제로 80%의 경우에 맞아야 한다는 개념. 일기예보가 '비 올 확률 80%'라고 했을 때 실제로 10번 중 8번 비가 와야 잘 보정된 것.

Conformal Prediction통계적으로 보장된 범위 내에서 답의 후보 집합을 출력하는 기법. '답은 A 또는 B 중 하나'처럼 확실히 정답을 포함하는 집합을 제공하는 것.

Related Resources

Original Abstract (Expand)

Large-language models (LLMs) and chatbot agents are known to provide wrong outputs at times, and it was recently found that this can never be fully prevented. Hence, uncertainty quantification plays a crucial role, aiming to quantify the level of ambiguity in either one overall number or two numbers for aleatoric and epistemic uncertainty. This position paper argues that this traditional dichotomy of uncertainties is too limited for the open and interactive setup that LLM agents operate in when communicating with a user, and that we need to research avenues that enrich uncertainties in this novel scenario. We review the literature and find that popular definitions of aleatoric and epistemic uncertainties directly contradict each other and lose their meaning in interactive LLM agent settings. Hence, we propose three novel research directions that focus on uncertainties in such human-computer interactions: Underspecification uncertainties, for when users do not provide all information or define the exact task at the first go, interactive learning, to ask follow-up questions and reduce the uncertainty about the current context, and output uncertainties, to utilize the rich language and speech space to express uncertainties as more than mere numbers. We expect that these new ways of dealing with and communicating uncertainties will lead to LLM agent interactions that are more transparent, trustworthy, and intuitive.