In-Context Prompting이 절차적 작업에서 Agent Orchestration을 대체한다

TL;DR Highlight

LangGraph 같은 에이전트 오케스트레이션 프레임워크 쓰지 말고, 절차 전체를 시스템 프롬프트에 넣으면 품질도 높고 실패율도 낮다.

Who Should Read

LangGraph, CrewAI 같은 에이전트 프레임워크로 고객 서비스 봇이나 절차적 워크플로우를 구축 중인 백엔드/AI 개발자. 프레임워크 도입 여부를 결정해야 하는 시니어 개발자나 AI 아키텍트.

Core Mechanics

LangGraph 같은 외부 오케스트레이터 대신, 절차 전체(모든 노드·엣지·조건)를 시스템 프롬프트에 직접 넣고 모델이 스스로 흐름을 관리하게 하는 방식이 일관되게 더 좋은 성능을 냄.
테스트한 3개 도메인(여행 예약 14노드, Zoom 기술지원 14노드, 보험 청구 55노드) 모두에서, 5개 평가 지표 전부 in-context 방식이 LangGraph보다 높아 15/15 비교에서 완승.
오케스트레이션이 실패율을 대폭 높임. 여행 예약에서 24% vs 11.5%, Zoom 지원에서 9% vs 0.5%(18배 차이), 보험에서 17% vs 5%. 이 실패들은 아키텍처 구조 자체에서 발생하는 것.
오케스트레이터가 실패하는 이유는 3가지: 노드마다 별도 LLM 호출로 추론이 쪼개져 전체 맥락 파악 불가(reasoning fragmentation), 라우팅 LLM 호출 자체의 실수, 노드 템플릿 주입이 모델의 자연스러운 대화를 방해.
LLM 호출 횟수도 오케스트레이터가 1.2~1.7배 더 많음. 보험 도메인 기준 17.3회 vs 10.0회. 더 많이 호출하고도 품질은 낮은 것.
비용은 in-context 방식이 1.3~1.4배 더 비쌈(절차 전체를 매 API 호출마다 포함하기 때문). 보험 도메인 기준 $0.22 vs $0.17. 하지만 품질 차이를 고려하면 충분히 감수할 만한 수준.

Evidence

Claude 심사 기준으로 15개 비교 전부 in-context 우세, p < 0.005 (Mann-Whitney U, Holm-Bonferroni 보정), 효과 크기 d = 0.37~1.01. GPT-4.1 독립 심사로 재검증 시 15개 중 11개 유의미하게 in-context 우세, 오케스트레이션이 우세한 경우는 0개.
일관성(Consistency) 점수: in-context 4.83~4.99 vs LangGraph 4.32~4.55. 가장 차이가 큰 지표로, 오케스트레이터의 노드별 분리 생성이 전체 대화 흐름 파악을 방해한다는 근거.
Zoom 도메인 실패율: LangGraph 9%(18건) vs in-context 0.5%(1건), 18배 차이. 기술 진단 라우팅의 각 분기점이 추가 실패 지점이 되는 구조적 문제를 보여줌.
보험 55노드 도메인에서 인슈어런스 대화당 평균 턴 수: LangGraph 26.4회 vs in-context 19.0회. 오케스트레이터가 추가 라우팅 결정을 반복하며 불필요한 턴을 생성하는 것.

How to Apply

LangGraph나 CrewAI로 고객 서비스 봇을 만들려고 한다면, 먼저 워크플로우 전체(노드, 엣지, 조건, 종료 상태)를 구조화된 텍스트로 직렬화해서 시스템 프롬프트에 통째로 넣고 테스트해보라. 프레임워크 없이 단순 API 호출만으로도 더 좋은 결과가 나올 수 있다.
절차가 context window에 들어가는지 확인하라. 논문에서 55노드 절차가 ~4,000 토큰이었고 200K 컨텍스트의 2%만 차지했다. 절차를 시스템 프롬프트 앞부분에 위치시키면 'Lost in the Middle' 문제도 피할 수 있다.
비용이 걱정된다면 (in-context 방식이 1.3~1.4배 비쌈), 논문에서 언급한 companion 방향인 절차를 소형 모델 가중치에 fine-tuning하는 것을 고려하라. 해당 연구에서 8B 모델을 컴파일해 LangGraph 오케스트레이터 수준 품질을 128~462배 낮은 비용으로 달성했다고 함.

Code Example

snippet

# LangGraph 오케스트레이터 대신 절차 전체를 시스템 프롬프트에 넣는 in-context 방식

import anthropic

# 워크플로우를 직렬화된 텍스트로 표현
PROCEDURE = """
PROCEDURE: travel_booking_v2
========================================
Step 1: OPENING [AGENT]
Greet the user warmly and ask how you can help.
-> Go to Step 2

Step 2: USER_INITIAL [USER waits for input]
-> Go to Step 3

Step 3: ASSESS [AGENT - DECISION POINT]
Assess the conversation state.
REQUIRED INFO: Destination, Travel dates, Number of travelers, Budget range.
ROUTES:
- If missing_info: Go to Step 4 (ask for missing info)
- If needs_clarification: Go to Step 5 (ask for clarification)
- If info_complete: Go to Step 6 (present options)
- If user_abandoning: Go to Step 7 (close gracefully)

Step 4: GATHER_INFO [AGENT]
Ask for the specific missing required information.
-> Go to Step 3

Step 5: CLARIFY [AGENT]
Ask for clarification on unclear information.
-> Go to Step 3

Step 6: PRESENT_OPTIONS [AGENT]
Present 2-3 tailored travel options based on gathered info.
-> Go to Step 8

Step 7: ABANDON [TERMINAL - EXIT]
Close gracefully.

Step 8: HANDLE_RESPONSE [AGENT - DECISION POINT]
ROUTES:
- If option_selected: Go to Step 9
- If needs_revision: Go to Step 6
- If user_abandoning: Go to Step 7

Step 9: FINALIZE [AGENT]
Confirm choice, summarize details, ask if ready to book.
-> Go to Step 10

Step 10: SUCCESS [TERMINAL - SUCCESS]
Confirm booking, provide tips, close warmly.
"""

SYSTEM_PROMPT = f"""You are a helpful travel booking assistant.

You must follow this procedure step by step. Determine which step you're at 
based on the conversation context and respond accordingly.
At decision points, choose the route that best matches the user's situation.

{PROCEDURE}

IMPORTANT RULES:
- Follow the procedure from Step 1 through to a terminal state
- At decision points, determine the best route based on context
- Do NOT mention step numbers or that you're following a script
- Be natural and conversational while adhering to the procedure
"""

client = anthropic.Anthropic()
conversation_history = []

def chat(user_message: str) -> str:
    conversation_history.append({"role": "user", "content": user_message})
    
    # 단일 API 호출 - 라우팅을 위한 추가 호출 없음
    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        system=SYSTEM_PROMPT,  # 절차 전체가 시스템 프롬프트에 포함
        messages=conversation_history
    )
    
    assistant_message = response.content[0].text
    conversation_history.append({"role": "assistant", "content": assistant_message})
    return assistant_message

# 사용 예시
print(chat("Hi, I want to book a trip to Japan for 2 people"))

Terminology

Agent OrchestrationLLM 위에 외부 컨트롤러를 두고, 대화의 각 단계마다 어느 노드로 이동할지 결정하고 해당 노드의 지시문을 LLM에 주입하는 아키텍처. 교통 경찰이 차량(LLM)을 일일이 통제하는 것과 비슷.

In-Context Prompting외부 컨트롤러 없이, 수행해야 할 절차 전체를 처음부터 프롬프트에 넣어두고 LLM이 스스로 흐름을 파악하며 진행하게 하는 방식. 요리사에게 레시피 전체를 미리 주는 것과 같음.

LLM-as-JudgeLLM이 생성한 대화 품질을 사람 대신 다른 LLM이 평가하는 방식. 채점자 역할을 AI가 맡는 것. 단, 같은 계열 모델이 자기 출력을 더 높게 평가하는 self-preference 편향이 있을 수 있음.

Decision Hub워크플로우 그래프에서 여러 방향으로 분기할 수 있는 노드. 예를 들어 '정보가 충분한지' 판단해서 다음 단계를 여러 가지 중 하나로 결정하는 지점.

Directed Graph노드(단계)와 엣지(전환 조건)로 절차를 표현한 구조. 순서도(flowchart)와 동일한 개념으로, 어떤 조건에서 어느 단계로 이동하는지를 명시함.

Mann-Whitney U두 그룹의 점수 분포가 통계적으로 유의미하게 다른지 검정하는 비모수 통계 방법. 정규분포를 가정하지 않아도 되는 t-검정의 대안.

Cohen's d두 그룹 간 차이의 크기를 표준화한 효과 크기(effect size) 지표. 0.2는 작은 차이, 0.5는 중간, 0.8 이상은 큰 차이로 해석함.

Holm-Bonferroni여러 번 통계 검정을 동시에 할 때 우연히 유의미한 결과가 나올 확률을 통제하는 보정 방법. 15개 비교를 동시에 할 때 발생하는 다중 비교 문제를 해결함.

Related Resources

Original Abstract (Expand)

Agent orchestration frameworks -- LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, and others -- place an external orchestrator above the LLM, tracking state and injecting routing instructions at every turn. We present a controlled comparison showing that for procedural tasks, this architecture is dominated by a simpler alternative: putting the entire procedure in the system prompt and letting the model self-orchestrate. Across three domains -- travel booking (14 nodes), Zoom technical support (14 nodes), and insurance claims processing (55 nodes) -- we evaluate 200 conversations per condition using LLM-as-judge scoring on five quality criteria. The in-context approach scores 4.53--5.00 on a 5-point scale while a LangGraph orchestrator using the same model scores 4.17--4.84. The orchestrated system fails on 24% of travel, 9% of Zoom, and 17% of insurance conversations, compared to 11.5%, 0.5%, and 5% for the in-context baseline. While external orchestration may have been necessary for earlier models, advances in frontier model capabilities have made it unnecessary for multi-turn conversations following a defined procedure.