일찍 물어라, 늦게 물어라, 제때 물어라: Long-Horizon Agent에서 Clarification 타이밍은 언제 중요한가?

TL;DR Highlight

AI 에이전트가 작업 중 언제 사용자에게 질문해야 하는지 처음으로 수치화한 연구 — 목표 clarification은 실행 10% 이내, 입력 clarification은 50% 이내가 한계.

Who Should Read

복잡한 멀티스텝 워크플로우를 실행하는 AI 에이전트를 개발하거나 설계하는 백엔드/AI 엔지니어. 에이전트가 불명확한 지시를 받았을 때 언제 사용자에게 되물을지 정책을 고민하는 분.

Core Mechanics

'일찍 물어볼수록 좋다'는 상식이 틀렸음 — 무엇이 빠졌느냐에 따라 최적 타이밍이 완전히 달라짐.
목표(goal) clarification은 실행의 10% 이내에 받아야 효과가 있음 (pass@3: 0.78 vs. 아무것도 안 물으면 0.40). 30%만 지나도 절반 수준으로 떨어짐.
입력(input) clarification은 실행 50%까지는 효과가 유지되지만, 70%를 넘으면 아무것도 안 물은 것보다 오히려 나빠짐.
제약(constraint) clarification은 도중에 끼워넣으면 오히려 기존 작업과 충돌해서 점수가 떨어지는 경우가 있음 — 시작 전에 물어보는 게 최선.
어떤 타이밍이든 궤적 중반(50%)을 넘겨서 clarification하면 성능이 아예 안 물은 것보다 낮아짐.
테스트한 프론티어 모델 중 최적 타이밍에 질문하는 모델은 하나도 없음: GPT-5.2는 과도하게 질문(세션의 52%), Claude Sonnet 4.5는 소극적(23%), Gemini 3 Flash는 아예 안 물어봄(0%).

Evidence

MCP-Atlas 벤치마크에서 목표 clarification을 10%에 주입하면 pass@3 = 0.78, oracle = 0.80이지만, 70%에 주입하면 0.39로 no-clarification 기준선(0.40)과 동일해짐.
입력 clarification은 Inj-10에서 0.46, Inj-50에서 0.36으로 NC(0.33) 위를 유지하지만, Inj-90에서는 0.25로 NC 아래로 떨어짐.
낭비 컴퓨팅(wasted compute): TheAgentCompany에서 주입 시점이 늦어질수록 Inj-10: 0%, Inj-90: 21.7%로 선형 증가. SWE-Bench Pro는 Inj-10에서 0.7 액션, Inj-90에서 10.4 액션 낭비.
동일 태스크 셋을 공유하는 3개 모델 간 Kendall τ 상관계수 0.78~0.87 — 타이밍 효과는 모델이 아니라 태스크 자체에 내재된 특성임을 보여줌.

How to Apply

에이전트 파이프라인 설계 시, 작업 시작 직후 첫 1~2번의 액션 안에 '목표가 명확한가?' 를 체크하는 early goal validation 단계를 강제로 삽입하라. 10% 넘어서 물어보면 이미 늦음.
입력 데이터 소스나 파일 위치가 불명확한 경우, 에이전트가 탐색 액션을 몇 번 한 후 (전체 궤적의 30~50% 이내) 질문하도록 정책을 설계하면 됨 — 50%를 넘기지 않는 것이 핵심.
규칙/제약(constraint) 정보가 빠진 경우, 작업 중간에 끼워넣으면 기존 결과물과 충돌해서 오히려 역효과. 프롬프트 앞단에서 제약 조건을 먼저 확인하거나, 실행 전 pre-flight check 단계를 별도로 두어라.

Code Example

snippet

# 에이전트 clarification 타이밍 정책 예시
# forced-injection 논문의 ask_user 도구 스키마를 참고한 구현

ask_user_tool = {
    "name": "ask_user",
    "description": "Ask the user a clarifying question to get more information about the task. Use this when the task is ambiguous or you need specific details to proceed.",
    "parameters": {
        "properties": {
            "question": {
                "type": "string",
                "description": "The clarifying question to ask."
            },
            "context": {
                "type": "string",
                "description": "Optional additional context.",
                "default": ""
            }
        },
        "required": ["question"]
    }
}

# 타이밍 정책: 정보 유형별 최대 허용 시점
CLARIFICATION_DEADLINES = {
    "goal":       0.10,  # 전체 예상 액션의 10% 이내에 물어볼 것
    "input":      0.50,  # 50% 이내
    "constraint": 0.00,  # 실행 전(pre-execution)에만 허용
    "context":    0.10,  # 10% 이내 (MCP-Atlas 기준)
}

def should_ask_clarification(
    missing_info_type: str,
    current_action: int,
    estimated_total_actions: int
) -> bool:
    """현재 시점에 clarification을 요청해야 하는지 판단"""
    deadline = CLARIFICATION_DEADLINES.get(missing_info_type, 0.10)
    current_progress = current_action / max(estimated_total_actions, 1)
    
    if current_progress > deadline:
        print(f"[WARN] {missing_info_type} clarification이 너무 늦었습니다 "
              f"(현재: {current_progress:.0%}, 마감: {deadline:.0%}). "
              f"묻지 않는 것보다 성능이 낮아질 수 있습니다.")
        return False
    return True

# 사용 예시
if should_ask_clarification("goal", current_action=1, estimated_total_actions=20):
    # 첫 번째 액션 직후 목표 확인 질문
    response = call_llm_with_tool(ask_user_tool, 
        prompt="Should I output the result in CSV or JSON format?")

Terminology

Long-Horizon Agent수십~수백 개의 순차적 액션을 연속으로 실행하는 AI 에이전트. 단순 Q&A가 아니라 파일 수정, 웹 탐색, DB 조회 등을 연달아 수행하는 복잡한 워크플로우를 처리함.

Clarification에이전트가 불명확한 지시를 받았을 때 사용자에게 되묻는 행동. '어떤 포맷으로 드릴까요?' 같은 질문.

VOI (Value of Information)특정 시점에 정보를 얻는 것이 얼마나 가치 있는지를 수치화한 개념. 정보를 늦게 받을수록 이미 잘못된 방향으로 일을 진행한 뒤라 가치가 떨어짐.

pass@3같은 작업을 3번 시도했을 때 최소 1번 성공할 확률. 에이전트 성능을 측정하는 지표.

Forced Injection이 논문에서 고안한 실험 방법. 에이전트에게 특정 시점에 강제로 정보를 주입해서, 타이밍에 따라 성능이 어떻게 달라지는지 측정함.

Kendall τ (타우)두 모델이 같은 문제에서 비슷한 순위를 매기는지 측정하는 통계 지표. 1에 가까울수록 두 모델의 평가가 일치함.

Wasted Compute잘못된 가정으로 에이전트가 수행한 액션 중 나중에 버려진 것의 비율. 늦게 clarification할수록 이 낭비가 커짐.

Oracle완전한 정보를 미리 받은 이상적인 에이전트. 실험에서 성능의 상한선으로 사용됨.

Related Resources

Original Abstract (Expand)

Long-horizon AI agents execute complex workflows spanning hundreds of sequential actions, yet a single wrong assumption early on can cascade into irreversible errors. When instructions are incomplete, the agent must decide not only whether to ask for clarification but when, and no prior work measures how clarification value changes over the course of execution. We introduce a forced-injection framework that provides ground-truth clarifications at controlled points in the agent's trajectory across four information dimensions (goal, input, constraint, context), three agent benchmarks, and four frontier models (three per benchmark; one on a single benchmark only; 84 task variants; 6,000+ runs). Counter to the common intuition that "earlier is always better," we find that the value of clarification depends sharply on what information is missing: goal clarification loses nearly all value after 10% of execution (pass@3 drops from 0.78 to baseline), while input clarification retains value through roughly 50%. Deferring any clarification type past mid-trajectory degrades performance below never asking at all. Cross-model Kendall tau correlations (0.78-0.87 among models sharing identical task coverage; 0.34-0.67 across the full 4-model panel) confirm these timing profiles are substantially task-intrinsic. A complementary study of 300 unscripted sessions reveals that no current frontier model asks within the empirically optimal window, with strategies ranging from over-asking (52% of sessions) to never asking at all. These empirical demand curves provide the quantitative foundation that existing theoretical frameworks require but have lacked, and establish concrete design targets for timing-aware clarification policies. Code and data will be publicly released.